Scrapy Crawler only pulls 19 of 680+ urls - python

I'm trying to scrape this page: https://coinmarketcap.com/currencies/views/all/
in td[2] of all the rows is a link. I am trying to ask scrapy to go to each link in that td, and scrape pages that link represents. Below is my code:
note: another person was awesome in helping me get this far
class ToScrapeSpiderXPath(CrawlSpider):
name = 'coinmarketcap'
start_urls = [
'https://coinmarketcap.com/currencies/views/all/'
]
rules = (
Rule(LinkExtractor(restrict_xpaths=('//td[2]/a',)), callback="parse", follow=True),
)
def parse(self, response):
BTC = BTCItem()
BTC['source'] = str(response.request.url).split("/")[2]
BTC['asset'] = str(response.request.url).split("/")[4],
BTC['asset_price'] = response.xpath('//*[#id="quote_price"]/text()').extract(),
BTC['asset_price_change'] = response.xpath(
'/html/body/div[2]/div/div[1]/div[3]/div[2]/span[2]/text()').extract(),
BTC['BTC_price'] = response.xpath('/html/body/div[2]/div/div[1]/div[3]/div[2]/small[1]/text()').extract(),
BTC['Prct_change'] = response.xpath('/html/body/div[2]/div/div[1]/div[3]/div[2]/small[2]/text()').extract()
yield (BTC)
even tho the table exceeds 600+ links/pages, when I run scrapy crawl coinmarketcap, I only get 19 records. This means only 19 pages from this list of 600+. I'm failing to see the issue stopping the scrape. any help would be greatly appreciated.
Thanks

Your spider goes too deep: with that rule it find and follow links also in the single coin's pages. You can roughly fix the problem adding a DEPTH_LIMIT = 1, but you can for sure find a more elegant solution.
Here the code that works for me (there are other minor adjustment too):
class ToScrapeSpiderXPath(CrawlSpider):
name = 'coinmarketcap'
start_urls = [
'https://coinmarketcap.com/currencies/views/all/'
]
custom_settings = {
'DEPTH_LIMIT': '1',
}
rules = (
Rule(LinkExtractor(restrict_xpaths=('//td[2]',)),callback="parse_item", follow=True),
)
def parse_item(self, response):
BTC = BTCItem()
BTC['source'] = str(response.request.url).split("/")[2]
BTC['asset'] = str(response.request.url).split("/")[4]
BTC['asset_price'] = response.xpath('//*[#id="quote_price"]/text()').extract()
BTC['asset_price_change'] = response.xpath(
'/html/body/div[2]/div/div[1]/div[3]/div[2]/span[2]/text()').extract()
BTC['BTC_price'] = response.xpath('/html/body/div[2]/div/div[1]/div[3]/div[2]/small[1]/text()').extract()
BTC['Prct_change'] = response.xpath('/html/body/div[2]/div/div[1]/div[3]/div[2]/small[2]/text()').extract()
yield (BTC)

Related

Scraping links with Scrapy

I am trying to scrape a swedish real estate website www.booli.se . However, i can't figure out how to follow links for each house and extract for example price, rooms, age etc. I only know how to scrape one page and i can't seem to wrap my head around this. I am looking to do something like:
for link in website:
follow link
attribute1 = item.css('cssobject::text').extract()[1]
attribute2 = item.ss('cssobject::text').extract()[2]
yield{'Attribute 1': attribute1, 'Attribute 2': attribute2}
So that i can scrape the data and output it to an excel-file. My code for scraping a simple page without following links is as follows:
import scrapy
class BooliSpider(scrapy.Spider):
name = "boolidata"
start_urls = [
'https://www.booli.se/slutpriser/lund/116978/'
]
'''def parse(self, response):
for link in response.css('.nav-list a::attr(href)').extract():
yield scrapy.Request(url=response.urljoin(link),
callback=self.collect_data)'''
def parse(self, response):
for item in response.css('li.search-list__item'):
size = item.css('span.search-list__row::text').extract()[1]
price = item.css('span.search-list__row::text').extract()[3]
m2price = item.css('span.search-list__row::text').extract()[4]
yield {'Size': size, 'Price': price, 'M2price': m2price}
Thankful for any help. Really having trouble getting it all together and outputting specific link contents to a cohesive output-file (excel).
You could use scrapy's CrawlSpider for following and scraping links
Your code should look like this:
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spider import CrawlSpider, Rule
class BooliItem(scrapy.Item):
size = scrapy.Field()
price = scrapy.Field()
m2price = scrapy.Field()
class BooliSpider(CrawlSpider):
name = "boolidata"
start_urls = [
'https://www.booli.se/slutpriser/lund/116978/',
]
rules = [
Rule(
LinkExtractor(
allow=(r'listing url pattern here to follow'),
deny=(r'other url patterns to deny'),
),
callback='parse_item',
follow=True,
),
]
def parse_item(self, response):
item = BooliItem()
item['size'] = response.css('size selector').extract()
item['price'] = response.css('price selector').extract()
item['m2price'] = response.css('m2price selector').extract()
return item
And you can run your code via:
scrapy crawl booli -o booli.csv
and import your csv to Excel.

Scrapy spider prefers one domain (which slows down process)

I'm working on Crawler which gets a list of domains and for every domain, counts number of urls on this page.
I use CrawlSpider for this purpose but there is a problem.
When I start crawling, it seems to send multiple requests to multiple domains but after some time (one minute), it ends crawling one page (domain).
SETTINGS
CONCURRENT_REQUESTS = 100
CONCURRENT_REQUESTS_PER_DOMAIN = 3
REACTOR_THREADPOOL_MAXSIZE = 20
Here you can see how many urls has been scraped for particular domain:
AFTER 7 minutes - as you can see it aims only on first domain and forgot about others
If scrapy aims just on one domain at once, it logically slows down process. I would like to send requests to multiple domains in short time.
class MainSpider(CrawlSpider):
name = 'main_spider'
allowed_domains = []
rules = (
Rule(LinkExtractor(), callback='parse_item', follow=True, ),
)
def start_requests(self):
for d in Domain.objects.all():
self.allowed_domains.append(d.name)
yield scrapy.Request(d.main_url, callback=self.parse_item, meta={'domain': d})
def parse_start_url(self, response):
self.parse_item(response)
def parse_item(self, response):
d = response.meta['domain']
d.number_of_urls = d.number_of_urls + 1
d.save()
extractor = LinkExtractor(allow_domains=d.name)
links = extractor.extract_links(response)
for link in links:
yield scrapy.Request(link.url, callback=self.parse_item,meta={'domain': d})
It seems to focus only on the first domain until it doesn't scrape it all.

Scrapy CSS selector

I am learning how to use scrapy but I am having some issue. I wrote this code, following an online tutorial, to understand a bit more about it.
import scrapy
class BrickSetSpider(scrapy.Spider):
name = 'brick_spider'
start_urls = ['http://brickset.com/sets/year-2016']
def parse(self, response):
SET_SELECTOR = '.set'
for brickset in response.css(SET_SELECTOR):
NAME_SELECTOR = 'h1 a ::text'
PIECES_SELECTOR = './/dl[dt/text() = "Pieces"]/dd/a/text()'
MINIFIGS_SELECTOR = './/dl[dt/text() = "Minifigs"]/dd[2]/a/text()'
PRICE_SELECTOR = './/dl[dt/text() = "RRP"]/dd[3]/text()'
IMAGE_SELECTOR = 'img ::attr(src)'
yield {
'name': brickset.css(NAME_SELECTOR).extract_first(),
'pieces': brickset.xpath(PIECES_SELECTOR).extract_first(),
'minifigs': brickset.xpath(MINIFIGS_SELECTOR).extract_first(),
'retail price': brickset.xpath(PRICE_SELECTOR).extract_first(),
'image': brickset.css(IMAGE_SELECTOR).extract_first(),
}
NEXT_PAGE_SELECTOR = '.next a ::attr(href)'
next_page = response.css(NEXT_PAGE_SELECTOR).extract_first()
if next_page:
yield scrapy.Request(
response.urljoin(next_page),
callback=self.parse
)
Since the sites divide the product listed in years and this code crawls just data from 2016 I decided to extend it and analyze also the data of previous years. The idea of the code is this:
PREVIOUS_YEAR_SELECTOR = '...'
previous_year= response.css(PREVIOUS_YEAR_SELECTOR).extract_first()
if previous_year:
yield scrapy.Request(
response.urljoin(previous_year),
callback=self.parse
)
I tried different things but I really have no idea of what to write instead of '...'
I also tried with xpath but nothing seems to work.
You have at least two options here.
The first is to use generic CrawlSpider
and define which links you want to extract and follow.
Something like this:
class BrickSetSpider(scrapy.CrawlSpider):
name = 'brick_spider'
start_urls = ['http://brickset.com/sets']
rules = (
Rule(LinkExtractor(
allow=r'\/year\-[\d]{4}'), callback='parse_bricks', follow=True),
)
#Your method renamed to parse_bricks goes here
Note: you need to rename parse method to some other name like 'parse_bricks' since the CrawlSpider uses the parse method itself.
The second options are to set start_urls to a page http://brickset.com/browse/sets containing all links to year sets and add a method to parse those links:
class BrickSetSpider(scrapy.Spider):
name = 'brick_spider'
start_urls = ['http://brickset.com/browse/sets']
def parse(self, response):
links = response.xpath(
'//a[contains(#href, "/sets/year")]/#href').extract()
for link in links:
yield scrapy.Request(response.urljoin(link), callback=self.parse_bricks)
# Your method renamed to parse_bricks goes here
Maybe you want to exploit the structure of the href attribute? It seems to follow the structure /sets/year-YYYY. By this you can use a regex based selector or - if you are lazy like my - just a contains():
XPath: //a[contains(#href,"/sets/year-")]/#href
I'm not sure if this is also possible with CSS. So the ... can be filled with:
PREVIOUS_YEAR_SELECTOR_XPATH = '//a[contains(#href,"/sets/year-")]/#href'
previous_year = response.xpath(PREVIOUS_YEAR_SELECTOR).extract_first()
But I think you will go for ALL years, so maybe you want to loop over the links:
PREVIOUS_YEAR_SELECTOR_XPATH = '//a[contains(#href,"/sets/year-")]/#href'
for previous_year in response.xpath(PREVIOUS_YEAR_SELECTOR):
yield scrapy.Request(response.urljoin(previous_year), callback=self.parse)
I think you are on a good way. Google for an CSS/XPATH cheat sheet that matches your needs and checkout the FirePath extension or similar. It speeds up the selector setup a lot :)

How to feed a spider with links crawled within the spider?

I'm writing a spider (CrawlSpider) for an online store. According to client requisites, I need to write two rules: one for determining which pages have items and other for extracting the items.
I have both rules already working independently:
if my start_urls = ["www.example.com/books.php",
"www.example.com/movies.php"] and I comment the Rule and the code
of parse_category, my parse_item will extract every item.
On the other hand, if start_urls = "http://www.example.com" and I
comment the Ruleand the code of parse_item, parse_category will
return every link in which there a items for extracting, i.e.
parse_category will return www.example.com/books.php and
www.example.com/movies.php.
My problem is that I don't know how to merge both modules, so that start_urls = "http://www.example.com" and then parse_category extracts www.example.com/books.php and www.example.com/movies.php and feed those links to parse_item, where I actually extract the info of each item.
I need to find a way to do it this way instead of just using start_urls = ["www.example.com/books.php", "www.example.com/movies.php"] because if in the future a new category is added (e.g. www.example.com/music.php), the spider wouldn't be able to automatically detect that new category and should be manually edited. Not a big deal, but the client doesn't want this.
class StoreSpider (CrawlSpider):
name = "storyder"
allowed_domains = ["example.com"]
start_urls = ["http://www.example.com/"]
#start_urls = ["http://www.example.com/books.php", "http://www.example.com/movies.php"]
rules = (
Rule(LinkExtractor(), follow=True, callback='parse_category'),
Rule(LinkExtractor(), follow=False, callback="parse_item"),
)
def parse_category(self, response):
category = StoreCategory()
# some code for determining whether the current page is a category, or just another stuff
if is a category:
category['name'] = name
category['url'] = response.url
return category
def parse_item(self, response):
item = StoreItem()
# some code for extracting the item's data
return item
the CrawlSpider rules don't work like you want, you'll need to implement the logic by yourself. when you specify follow=True you can't use callback, because the idea is to keep getting links (no items) while following the rules, check the documentation
you could try with something like:
class StoreSpider (CrawlSpider):
name = "storyder"
allowed_domains = ["example.com"]
start_urls = ["http://www.example.com/"]
# no rules
def parse(self, response): # this is parse_category
category_le = LinkExtractor("something for categories")
for a in category_le.extract_links(response):
yield Request(a.url, callback=self.parse_category)
item_le = LinkExtractor("something for items")
for a in item_le.extract_links(response):
yield Request(a.url, callback=self.parse_item)
def parse_category(self, response):
category = StoreCategory()
# some code for determining whether the current page is a category, or just another stuff
if is a category:
category['name'] = name
category['url'] = response.url
yield category
for req in self.parse(response):
yield req
def parse_item(self, response):
item = StoreItem()
# some code for extracting the item's data
return item
Instead of using a parse_category, I used restrict_css in LinkExtractorto get the links I want, and it seems to be feeding the second Rule with the extracted links, so my question is answered. It ended up this way:
class StoreSpider (CrawlSpider):
name = "storyder"
allowed_domains = ["example.com"]
start_urls = ["http://www.example.com/"]
rules = (
Rule(LinkExtractor(restrict_css=("#movies", "#books"))),
Rule(LinkExtractor(), callback="parse_item"),
)
def parse_item(self, response):
item = StoreItem()
# some code for extracting the item's data
return item
Still it can't detect new added categories (and there is not a clear pattern for using in restrict_css without fetching other garbage), but at least it's complying with the requisites of the client: 2 rules, one for extracting category's links and other for extracting item's data.

Scrapy spider get information that is inside of links

I have done and spider that can take the information of this page and it can follow "Next page" links. Now, the spider just takes the information that i'm showing in the following structure.
The structure of the page is something like this
Title 1
URL 1 ---------> If you click you go to one page with more information
Location 1
Title 2
URL 2 ---------> If you click you go to one page with more information
Location 2
Next page
Then, that i want is that the spider goes on each URL link and get full information. I suppose that i must generate another rule that specify that i want do something like this.
The behaviour of the spider it should be:
Go to URL1 (get info)
Go to URL2 (get info)
...
Next page
But i don't know how i can implement it. Can someone guide me?
Code of my Spider:
class BcnSpider(CrawlSpider):
name = 'bcn'
allowed_domains = ['guia.bcn.cat']
start_urls = ['http://guia.bcn.cat/index.php?pg=search&q=*:*']
rules = (
Rule(
SgmlLinkExtractor(
allow=(re.escape("index.php")),
restrict_xpaths=("//div[#class='paginador']")),
callback="parse_item",
follow=True),
)
def parse_item(self, response):
self.log("parse_item")
sel = Selector(response)
sites = sel.xpath("//div[#id='llista-resultats']/div")
items = []
cont = 0
for site in sites:
item = BcnItem()
item['id'] = cont
item['title'] = u''.join(site.xpath('h3/a/text()').extract())
item['url'] = u''.join(site.xpath('h3/a/#href').extract())
item['when'] = u''.join(site.xpath('div[#class="dades"]/dl/dd[1]/text()').extract())
item['where'] = u''.join(site.xpath('div[#class="dades"]/dl/dd[2]/span/a/text()').extract())
item['street'] = u''.join(site.xpath('div[#class="dades"]/dl/dd[3]/span/text()').extract())
item['phone'] = u''.join(site.xpath('div[#class="dades"]/dl/dd[4]/text()').extract())
items.append(item)
cont = cont + 1
return items
EDIT After searching in internet I found a code with which i can do that.
First of all, I have to get all the links, then I have to call another parse method.
def parse(self, response):
#Get all URL's
yield Request( url= _url, callback=self.parse_details )
def parse_details(self, response):
#Detailed information of each page
If you want use Rules because the page have a paginator, you should change def parse to def parse_start_url and then call this method through Rule. With this changes you make sure that the parser begins at the parse_start_url and the code it would be something like this:
rules = (
Rule(
SgmlLinkExtractor(
allow=(re.escape("index.php")),
restrict_xpaths=("//div[#class='paginador']")),
callback="parse_start_url",
follow=True),
)
def parse_start_url(self, response):
#Get all URL's
yield Request( url= _url, callback=self.parse_details )
def parse_details(self, response):
#Detailed information of each page
Thant's all folks
There is an easier way of achieving this. Click next on your link, and read the new url carefully:
http://guia.bcn.cat/index.php?pg=search&from=10&q=*:*&nr=10
By looking at the get data in the url (everything after the questionmark), and a bit of testing, we find that these mean
from=10 - Starting index
q=*:* - Search query
nr=10 - Number of items to display
This is how I would've done it:
Set nr=100 or higher. (1000 may do as well, just be sure that there is no timeout)
Loop from from=0 to 34300. This is above the number of entries currently. You may want to extract this value first.
Example code:
entries = 34246
step = 100
stop = entries - entries % step + step
for x in xrange(0, stop, step):
url = 'http://guia.bcn.cat/index.php?pg=search&from={}&q=*:*&nr={}'.format(x, step)
# Loop over all entries, and open links if needed

Categories