So basically I want to use Scrapy.org in order to scrape a forum. The problem I encounter is that the link to every thread are somewhat along this line http://mywebsite.com/forum/My-Thread-Name-t213.html
Now, if I try to enter just http://mywebsite.com/forum/t213.html it doesn't work, it doesn't show the topic with that ID so I don't really know how I could generate the thread name and the id of each topic in order to be able to scrape it.
I would really appreciate some help with this one, thanks in advance !
In the absence of an actual URL to test, I cannot be absolutely sure that this is going to work. Essentially you need to use a regular expression in a CrawlSpider rule that starts with your base URL and matches that plus any string followed by -t, plus any number and then finally .html.
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
class ThreadSpider(CrawlSpider):
name = "mywebsite"
allowed_domains = ["mywebsite.com"]
start_urls = ["http://mywebsite.com/forum"]
rules = [Rule(SgmlLinkExtractor(allow = ('/[^/]+-t\d+\.html')), follow=True,
callback='parse_item'),]
def parse_item(self, response):
hxs = HtmlXPathSelector(response)
print "We're scraping %s" % response.url
# do something with the hxs object
Related
I'm new to scrapy and cant get it to do anything. Eventually I want to scrape all the html comments from a website by following internal links.
For now I'm just trying to scrape the internal links and add them to a list.
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class comment_spider(CrawlSpider):
name = 'test'
allowed_domains = ['https://www.andnowuknow.com/']
start_urls = ["https://www.andnowuknow.com/"]
rules = (Rule(LinkExtractor(), callback='parse_start_url', follow=True),)
def parse_start_url(self, response):
return self.parse_item(response)
def parse_item(self, response):
urls = []
for link in LinkExtractor(allow=(),).extract_links(response):
urls.append(link)
print(urls)
I'm just trying get it to print something at this point, nothing I've tried so far works.
It finishes with an exit code of 0, but won't print so I cant tell whats happening.
What am I missing?
Surely your messages log should give us some hints, but I see your allowed_domains has a URL instead of a domain. You should set it like this:
allowed_domains = ["andnowuknow.com"]
(See it in the official documentation)
Hope it helps.
I'm using a scrapy web crawler to extract a bunch of data, as I describe here, I've figured out a brute force way to get the information I want, but.. it's really pretty crude. I just ennumerate all the pages I want to scrape, which is a few hundred. I need to get this done, so I might just grit my teeth and bear it like a moron, but it would be so much nicer to automate this. How could this process be implemented with link extraction using scrapy? I've looked at the documentation and made some experiments as I desribe in the question linked above but nothing yet has worked. This is the brute force code:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from brute_force.items import BruteForceItem
class DmozSpider(BaseSpider):
name = "brutus"
allowed_domains = ["tool.httpcn.com"]
start_urls = ["http://tool.httpcn.com/Html/Zi/21/PWAZAZAZXVILEPWXV.shtml",
"http://tool.httpcn.com/Html/Zi/21/PWAZAZCQCQILEPWB.shtml",
"http://tool.httpcn.com/Html/Zi/21/PWAZAZCQKOILEPWD.shtml",
"http://tool.httpcn.com/Html/Zi/21/PWAZAZCQUYILEPWF.shtml",
"http://tool.httpcn.com/Html/Zi/21/PWAZAZCQMEILEKOCQ.shtml",
"http://tool.httpcn.com/Html/Zi/21/PWAZAZCQRNILEKOKO.shtml",
"http://tool.httpcn.com/Html/Zi/22/PWCQKOILUYUYKOTBCQ.shtml",
"http://tool.httpcn.com/Html/Zi/21/PWAZAZAZRNILEPWRN.shtml",
"http://tool.httpcn.com/Html/Zi/21/PWAZAZCQPWILEPWC.shtml",
"http://tool.httpcn.com/Html/Zi/21/PWAZAZCQILILEPWE.shtml",
"http://tool.httpcn.com/Html/Zi/21/PWAZAZCQTBILEKOAZ.shtml",
"http://tool.httpcn.com/Html/Zi/21/PWAZAZCQXVILEKOPW.shtml",
"http://tool.httpcn.com/Html/Zi/21/PWAZAZPWAZILEKOIL.shtml",
"http://tool.httpcn.com/Html/Zi/22/PWCQKOILRNUYKOTBUY.shtml"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
items = []
item = BruteForceItem()
item["the_strokes"] = hxs.xpath('//*[#id="div_a1"]/div[2]').extract()
item["character"] = hxs.xpath('//*[#id="div_a1"]/div[3]').extract()
items.append(item)
return items
I think this is what you want:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from brute_force.items import BruteForceItem
from urlparse import urljoin
class DmozSpider(BaseSpider):
name = "brutus"
allowed_domains = ["tool.httpcn.com"]
start_urls = ['http://tool.httpcn.com/Zi/BuShou.html']
def parse(self, response):
for url in response.css('td a::attr(href)').extract():
cb = self.parse if '/zi/bushou' in url.lower() else self.parse_item
yield Request(urljoin(response.url, url), callback=cb)
def parse_item(self, response):
hxs = HtmlXPathSelector(response)
item = BruteForceItem()
item["the_strokes"] = hxs.xpath('//*[#id="div_a1"]/div[2]').extract()
item["character"] = hxs.xpath('//*[#id="div_a1"]/div[3]').extract()
return item
try this
1.
the spider start with the start_urls.
2.
self.parse. I just find all the a tag in the td tag.
if the url contains '/zi/bushou' then the response should be go to self.parse again because it is what you called 'second layer'.
if not '/zi/bushou' (i think use a more specific regex here is better) like url. i think it is what you want and goes to parse_item function.
3.
self.parse_item. this is the function that you use to get the information from the final page.
I want to extract data from http://community.sellfree.co.kr/. Scrapy is working, however it appears to only scrape the start_urls, and doesn't crawl any links.
I would like the spider to crawl the entire site.
The following is my code:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from metacritic.items import MetacriticItem
class MetacriticSpider(BaseSpider):
name = "metacritic" # Name of the spider, to be used when crawling
allowed_domains = ["sellfree.co.kr"] # Where the spider is allowed to go
start_urls = [
"http://community.sellfree.co.kr/"
]
rules = (Rule (SgmlLinkExtractor(allow=('.*',))
,callback="parse", follow= True),
)
def parse(self, response):
hxs = HtmlXPathSelector(response) # The XPath selector
sites = hxs.select('/html/body')
items = []
for site in sites:
item = MetacriticItem()
item['title'] = site.select('//a[#title]').extract()
items.append(item)
return items
There are two kinds of links on the page. One is onclick="location='../bbs/board.php?bo_table=maket_5_3' and another is <span class="list2">solution</span>
How can I get the crawler to follow both kinds of links?
Before I get started, I'd highly recommend using an updated version of Scrapy. It appears you're still using an old one, as many of the methods/classes you're using have been moved around or deprecated.
To the problem at hand: the scrapy.spiders.BaseSpider class will not do anything with the rules you specify. Instead, use the scrapy.contrib.spiders.CrawlSpider class, which has functionality to handle rules built into.
Next, you'll need to switch your parse() method to a new name, since the the CrawlSpider uses parse() internally to work. (We'll assume parse_page() for the rest of this answer)
To pick up all basic links, and have them crawled, your link extractor will need to be changed. By default, you shouldn't use regular expression syntax for domains you want to follow. The following will pick it up, and your DUPEFILTER will filter out links not on the site:
rules = (
Rule(SgmlLinkExtractor(allow=('')), callback="parse_page", follow=True),
)
As for the onclick=... links, these are JavaScript links, and the page you are trying to process relies on them heavily. Scrapy cannot crawl things like onclick=location.href="javascript:showLayer_tap('2')" or onclick="win_open('./bbs/profile.php?mb_id=wlsdydahs', because it can't execute showLayer_tap() or win_open() in Javascript.
(the following is untested, but should work and provide the basic idea of what you need to do)
You can write your own functions for parsing these, though. For instance, the following can handle onclick=location.href="./photo/":
def process_onclick(value):
m = re.search("location.href=\"(.*?)\"", value)
if m:
return m.group(1)
Then add the following rule (this only handles tables, expand it as needed):
Rule(SgmlLinkExtractor(allow=(''), tags=('table',),
attrs=('onclick',), process_value=process_onclick),
callback="parse_page", follow=True),
i'm doing a spider with scrapy that works if i don't implement any rules, but now i'm trying to implement a Rule to get paginator and scrape all the rest of pages. But i don't know why i can't achieve it.
Spider code:
allowed_domains = ['guia.bcn.cat']
start_urls = ['http://guia.bcn.cat/index.php?pg=search&q=*:*']
rules = (
Rule(SgmlLinkExtractor(allow=("index.php?pg=search&from=10&q=*:*&nr=10"),
restrict_xpaths=("//div[#class='paginador']",))
, callback="parse_item", follow=True),)
def parse_item(self, response)
...
Also, i tried to set "index.php" in allow parameter of the rule, but neither works.
I read in scrapy groups that i have not put "a/" or "a/#href" because SgmlLinkExtractor search automatically the link.
Console output seems to work well but don't get anything.
Any idea?
Thanks in advance
EDIT:
With this code works
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from bcncat.items import BcncatItem
import re
class BcnSpider(CrawlSpider):
name = 'bcn'
allowed_domains = ['guia.bcn.cat']
start_urls = ['http://guia.bcn.cat/index.php?pg=search&q=*:*']
rules = (
Rule(
SgmlLinkExtractor(
allow=(re.escape("index.php")),
restrict_xpaths=("//div[#class='paginador']")),
callback="parse_item",
follow=True),
)
def parse_item(self, response):
self.log("parse_item")
sel = Selector(response)
i = BcncatItem()
#i['domain_id'] = sel.xpath('//input[#id="sid"]/#value').extract()
#i['name'] = sel.xpath('//div[#id="name"]').extract()
#i['description'] = sel.xpath('//div[#id="description"]').extract()
return i
The allow parameter for SgmlLinkExtractor is a (list of) regular expression(s). So "?", "*" and "." are treated as special characters.
You can use allow=(re.escape("index.php?pg=search&from=10&q=*:*&nr=10")) (with import re somewhere at the beginning of your script)
EDIT: in fact, the above rule doesn't work. But as you already have the restricted region where you want to extract links, you can use allow=('index.php')
My first question here :)
I was trying to crawl my schools website for all possible webpages there are. But I cannot get the links into a text file. I have the right permissions, so that is not the problem.
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
from scrapy.spider import BaseSpider
class hsleidenSpider(CrawlSpider):
name = "hsleiden1"
allowed_domains = ["hsleiden.nl"]
start_urls = ["http://hsleiden.nl"]
# allow=() is used to match all links
rules = [
Rule(SgmlLinkExtractor(allow=()), follow=True),
Rule(SgmlLinkExtractor(allow=()), callback='parse_item')
]
def parse_item(self, response):
x = HtmlXPathSelector(response)
filename = "hsleiden-output.txt"
open(filename, 'ab').write(response.url)
So I am only scanning on the hsleiden.nl page. And I would like to have the response.url into the textfile hsleiden-output.txt.
Is there any way to do this right?
With reference to the documentation for CrawlSpider, if multiple rules match the same link then only the first will be used.
Thus, as a result of redirects, using the first rule results in a seemingly infinite loop. Since the second rule is ignored, none of the matching links are ever passed to the parse_item callback, which means no output file.
Some investigation is required to fix the redirect issue (and to modify the first rule so that it doesn't clash with the second), but commenting it out entirely will produce an output file of links like so:
http://www.hsleiden.nl/activiteitenkalenderhttp://www.hsleiden.nlhttp://www.hsleiden.nl/vind-je-studie/proefstuderenhttp://www.hsleiden.nl/studiumgenerale
etc
They were all munged together on a single line, so you might want to add a newline character or separator each time you write to the output file.