Using python scrapy to extract links from a webpage - python

I am a beginner with python and using scrapy to extract links from the following webpage
http://www.basketball-reference.com/leagues/NBA_2015_games.html.
The code that I have written is
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from basketball.items import BasketballItem
class BasketballSpider(CrawlSpider):
name = 'basketball'
allowed_domains = ['basketball-reference.com/']
start_urls = ['http://www.basketball-reference.com/leagues/NBA_2015_games.html']
rules = [Rule(LinkExtractor(allow=['http://www.basketball-reference.com/boxscores/^\w+$']), 'parse_item')]
def parse_item(self, response):
item = BasketballItem()
item['url'] = response.url
return item
I run this code through the command prompt, but the file created does not have any links. Could someone please help?

It cannot find the links, fix you regular expression in the rule:
rules = [
Rule(LinkExtractor(allow='boxscores/\w+'))
]
Also, you don't have to set the callback when it is called parse_item - it is a default.
And allow can be set as a string also.

rules = [
Rule(LinkExtractor(allow='boxscores/\w+'), callback='parse_item')
]

Related

Scrapy save every link whole domain

Introduction
currently im working on a crawler, which saves every link of a domain to a .csv-file
Problem
In my console, i can see, which links its following, but my items are still empty.
I get something like:
Here is my default code
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from ..items import LinkextractorItem
class TopArtSpider(CrawlSpider):
name = "topart"
start_urls = [
'https://www.topart-online.com/de/Bambus-Kunstbaeume/l-KAT11'
]
custom_settings = {'FEED_EXPORT_FIELDS' : ['Link'] }
rules = (
Rule(LinkExtractor(), callback='parse_item', follow=True),
)
def parse_item(self, response):
items = LinkextractorItem()
link = response.xpath('a/#href')
items['Link'] = link
yield items
my start_url is just a category of the domain, because i dont want to wait too long, as long as im trying to build the correct spider.
The XPATH selector isn't searching th entire DOM. Change it to this.
link = response.xpath('//a/#href')
The // searches the entire DOM.
You also are not grabbing the data, so you need to include getall() which will give you a list. You could also use a for loop, to loop around each link which I think is probably the approach you should do.
link = response.xpath('//a/#href')
for a in link:
items['Link'] = a.get()
yield items

Python: why is in scrapy crawlspider not printing or doing anything?

I'm new to scrapy and cant get it to do anything. Eventually I want to scrape all the html comments from a website by following internal links.
For now I'm just trying to scrape the internal links and add them to a list.
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class comment_spider(CrawlSpider):
name = 'test'
allowed_domains = ['https://www.andnowuknow.com/']
start_urls = ["https://www.andnowuknow.com/"]
rules = (Rule(LinkExtractor(), callback='parse_start_url', follow=True),)
def parse_start_url(self, response):
return self.parse_item(response)
def parse_item(self, response):
urls = []
for link in LinkExtractor(allow=(),).extract_links(response):
urls.append(link)
print(urls)
I'm just trying get it to print something at this point, nothing I've tried so far works.
It finishes with an exit code of 0, but won't print so I cant tell whats happening.
What am I missing?
Surely your messages log should give us some hints, but I see your allowed_domains has a URL instead of a domain. You should set it like this:
allowed_domains = ["andnowuknow.com"]
(See it in the official documentation)
Hope it helps.

scrapy rules do not call parsing method

I am new to scrapy and am trying to crawl a domain, following all internal links and scraping the title of url with the pattern /example/.*
crawling works, but the scraping of the title does not since the output file is empty. Most likely I got the rules wrong. Is this the right syntax using the rules in order to achieve what I am looking for?
import scrapy
class BidItem(scrapy.Item):
url = scrapy.Field()
title = scrapy.Field()
spider.py
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from bid.items import BidItem
class GetbidSpider(CrawlSpider):
name = 'getbid'
allowed_domains = ['domain.de']
start_urls = ['https://www.domain.de/']
rules = (
Rule(
LinkExtractor(),
follow=True
),
Rule(
LinkExtractor(allow=['example/.*']),
callback='parse_item'
),
)
def parse_item(self, response):
href = BidItem()
href['url'] = response.url
href['title'] = response.css("h1::text").extract()
return href
crawl: scrapy crawl getbid -o 012916.csv
From the CrawlSpider docs:
If multiple rules match the same link, the first one will be used,
according to the order they’re defined in this attribute.
Since your first rule will match all links, it will always be used and all other rules will be ignored.
Fixing the problem is as simple as switching the order of the rules.

CrawlSpider can't parse multipage in Scrapy

The CrawlSpider I've created is not doing it's job properly. It parses the first page and then stops without going on to the next page. Something I'm doing wrong but can't detect. Hope somebody out there gives me a hint what should I do to rectify it.
"items.py" includes:
from scrapy.item import Item, Field
class CraigslistScraperItem(Item):
Name = Field()
Link = Field()
CrawlSpider names "craigs.py" which contains :
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.selector import Selector
from craigslist_scraper.items import CraigslistScraperItem
class CraigsPySpider(CrawlSpider):
name = "craigs"
allowed_domains = ["craigslist.org"]
start_urls = (
'http://sfbay.craigslist.org/search/npo/',
)
rules=(Rule(LinkExtractor(allow = ('sfbay\.craigslist\.org\/search\/npo/.*',
),restrict_xpaths = ('//a[#class="button next"]')),callback = 'parse',follow = True),)
def parse(self, response):
page=response.xpath('//p[#class="result-info"]')
items=[]
for title in page:
item=CraigslistScraperItem()
item["Name"]=title.xpath('.//a[#class="result-title hdrlnk"]/text()').extract()
item["Link"]=title.xpath('.//a[#class="result-title hdrlnk"]/#href').extract()
items.append(item)
return items
And finally the command I'm using to get CSV output is:
scrapy crawl craigs -o items.csv -t csv
By the way, I tried to use "parse_item" in the first place but found no response that is why I used "parse" method instead. Thanks in advance.
Don't name your callback method parse when you use scrapy.CrawlSpider.
From Scrapy documentation:
When writing crawl spider rules, avoid using parse as callback, since
the CrawlSpider uses the parse method itself to implement its logic.
So if you override the parse method, the crawl spider will no longer
work.
Also, you don't need to append an item to list since you already using Scrapy Items and can simply yield item.
This code should work:
# -*- coding: utf-8 -*-
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from craigslist_scraper.items import CraigslistScraperItem
class CraigsPySpider(CrawlSpider):
name = "craigs"
allowed_domains = ["craigslist.org"]
start_urls = (
'http://sfbay.craigslist.org/search/npo/',
)
rules = (
Rule(LinkExtractor(allow=('\/search\/npo\?s=.*',)), callback='parse_item', follow=True),
)
def parse_item(self, response):
page = response.xpath('//p[#class="result-info"]')
for title in page:
item = CraigslistScraperItem()
item["Name"] = title.xpath('.//a[#class="result-title hdrlnk"]/text()').extract_first()
item["Link"] = title.xpath('.//a[#class="result-title hdrlnk"]/#href').extract_first()
yield item
Finally for output in csv format run: scrapy crawl craigs -o items.csv

I can't get the data using Rules on scrapy

i'm doing a spider with scrapy that works if i don't implement any rules, but now i'm trying to implement a Rule to get paginator and scrape all the rest of pages. But i don't know why i can't achieve it.
Spider code:
allowed_domains = ['guia.bcn.cat']
start_urls = ['http://guia.bcn.cat/index.php?pg=search&q=*:*']
rules = (
Rule(SgmlLinkExtractor(allow=("index.php?pg=search&from=10&q=*:*&nr=10"),
restrict_xpaths=("//div[#class='paginador']",))
, callback="parse_item", follow=True),)
def parse_item(self, response)
...
Also, i tried to set "index.php" in allow parameter of the rule, but neither works.
I read in scrapy groups that i have not put "a/" or "a/#href" because SgmlLinkExtractor search automatically the link.
Console output seems to work well but don't get anything.
Any idea?
Thanks in advance
EDIT:
With this code works
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from bcncat.items import BcncatItem
import re
class BcnSpider(CrawlSpider):
name = 'bcn'
allowed_domains = ['guia.bcn.cat']
start_urls = ['http://guia.bcn.cat/index.php?pg=search&q=*:*']
rules = (
Rule(
SgmlLinkExtractor(
allow=(re.escape("index.php")),
restrict_xpaths=("//div[#class='paginador']")),
callback="parse_item",
follow=True),
)
def parse_item(self, response):
self.log("parse_item")
sel = Selector(response)
i = BcncatItem()
#i['domain_id'] = sel.xpath('//input[#id="sid"]/#value').extract()
#i['name'] = sel.xpath('//div[#id="name"]').extract()
#i['description'] = sel.xpath('//div[#id="description"]').extract()
return i
The allow parameter for SgmlLinkExtractor is a (list of) regular expression(s). So "?", "*" and "." are treated as special characters.
You can use allow=(re.escape("index.php?pg=search&from=10&q=*:*&nr=10")) (with import re somewhere at the beginning of your script)
EDIT: in fact, the above rule doesn't work. But as you already have the restricted region where you want to extract links, you can use allow=('index.php')

Categories