This is the code for Spyder1 that I've been trying to write within Scrapy framework:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
from firm.items import FirmItem
class Spider1(CrawlSpider):
domain_name = 'wc2'
start_urls = ['http://www.whitecase.com/Attorneys/List.aspx?LastName=A']
rules = (
Rule(SgmlLinkExtractor(allow=["hxs.select(
'//td[#class='altRow'][1]/a/#href').re('/.a\w+')"]),
callback='parse'),
)
def parse(self, response):
hxs = HtmlXPathSelector(response)
JD = FirmItem()
JD['school'] = hxs.select(
'//td[#class="mainColumnTDa"]').re('(?<=(JD,\s))(.*?)(\d+)'
)
return JD
SPIDER = Spider1()
The regex in the rules successfully pulls all the bio urls that I want from the start url:
>>> hxs.select(
... '//td[#class="altRow"][1]/a/#href').re('/.a\w+')
[u'/cabel', u'/jacevedo', u'/jacuna', u'/aadler', u'/zahmedani', u'/tairisto', u
'/zalbert', u'/salberts', u'/aaleksandrova', u'/malhadeff', u'/nalivojvodic', u'
/kallchurch', u'/jalleyne', u'/lalonzo', u'/malthoff', u'/valvarez', u'/camon',
u'/randerson', u'/eandreeva', u'/pangeli', u'/jangland', u'/mantczak', u'/darany
i', u'/carhold', u'/marora', u'/garrington', u'/jartzinger', u'/sasayama', u'/ma
sschenfeldt', u'/dattanasio', u'/watterbury', u'/jaudrlicka', u'/caverch', u'/fa
yanruoh', u'/razar']
>>>
But when I run the code I get
[wc2] ERROR: Error processing FirmItem(school=[]) -
[Failure instance: Traceback: <type 'exceptions.IndexError'>: list index out of range
This is the FirmItem in Items.py
from scrapy.item import Item, Field
class FirmItem(Item):
school = Field()
pass
Can you help me understand where the index error occurs?
It seems to me that it has something to do with SgmLinkExtractor.
I've been trying to make this spider work for weeks with Scrapy. They have an excellent tutorial but I am new to python and web programming so I don't understand how for instance SgmlLinkExtractor works behind the scene.
Would it be easier for me to try to write a spider with the same simple functionality with Python libraries? I would appreciate any comments and help.
Thanks
SgmlLinkExtractor doesn't support selectors in its "allow" argument.
So this is wrong:
SgmlLinkExtractor(allow=["hxs.select('//td[#class='altRow'] ...')"])
This is right:
SgmlLinkExtractor(allow=[r"product\.php"])
The parse function is called for each match of your SgmlLinkExtractor.
As Pablo mentioned you want to simplify your SgmlLinkExtractor.
I also tried to put the names scraped from the initial url into a list and then pass each name to parse in the form of absolute url as http://www.whitecase.com/aabbas (for /aabbas).
The following code loops over the list, but I don't know how to pass this to parse . Do you think this is a better idea?
baseurl = 'http://www.whitecase.com'
names = ['aabbas', '/cabel', '/jacevedo', '/jacuna', '/igbadegesin']
def makeurl(baseurl, names):
for x in names:
url = baseurl + x
baseurl = 'http://www.whitecase.com'
x = ''
return url
Related
I'm SEO specialist, not really into coding.. But want to try to create a broken links checker in Python with Scrapy module, which will crawl my website and will show me all internal links with 404 code..
So far I have managed to write this code:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from crawler.items import Broken
class Spider(CrawlSpider):
name = 'example'
handle_httpstatus_list = [404]
allowed_domains = ['www.example.com']
start_urls = ['https://www.example.com']
rules = [Rule(LinkExtractor(), callback='parse_info', follow=True)]
def parse_info(self, response):
report = [404]
if response.status in report:
Broken_URLs = Broken()
#Broken_URLs['title']= response.xpath('/html/head/title').get()
Broken_URLs['referer'] = response.request.headers.get('Referer', None)
Broken_URLs['status_code']= response.status
Broken_URLs['url']= response.url
Broken_URLs['anchor']= response.meta.get('link_text')
return Broken_URLs
It's crawling well, as long as we have absolute url's in the site structure.
But there some cases when the crawler comes across with relative url's and end up with this kind of links:
Normally should be:
https://www.example.com/en/...
But it gives me:
https://www.example.com/en/en/... - double language folder, which end up with 404 code.
I'm trying to find a way to override this language duplication, with correct structure at the end.
Does somebody know the way how to fix it? Will much appreciate it!
Scrapy use urllib.parse.urljoin for working with relative urls.
You can fix it by adding custom function into process_request in Rule definition:
def fix_urls():
def process_request(request, response):
return request.replace(url=request.url.replace("/en/en/", "/en/"))
return process_request
class Spider(CrawlSpider):
name = 'example'
...
rules = [Rule(LinkExtractor(), process_request=fix_urls(), callback='parse_info', follow=True)]
The CrawlSpider I've created is not doing it's job properly. It parses the first page and then stops without going on to the next page. Something I'm doing wrong but can't detect. Hope somebody out there gives me a hint what should I do to rectify it.
"items.py" includes:
from scrapy.item import Item, Field
class CraigslistScraperItem(Item):
Name = Field()
Link = Field()
CrawlSpider names "craigs.py" which contains :
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.selector import Selector
from craigslist_scraper.items import CraigslistScraperItem
class CraigsPySpider(CrawlSpider):
name = "craigs"
allowed_domains = ["craigslist.org"]
start_urls = (
'http://sfbay.craigslist.org/search/npo/',
)
rules=(Rule(LinkExtractor(allow = ('sfbay\.craigslist\.org\/search\/npo/.*',
),restrict_xpaths = ('//a[#class="button next"]')),callback = 'parse',follow = True),)
def parse(self, response):
page=response.xpath('//p[#class="result-info"]')
items=[]
for title in page:
item=CraigslistScraperItem()
item["Name"]=title.xpath('.//a[#class="result-title hdrlnk"]/text()').extract()
item["Link"]=title.xpath('.//a[#class="result-title hdrlnk"]/#href').extract()
items.append(item)
return items
And finally the command I'm using to get CSV output is:
scrapy crawl craigs -o items.csv -t csv
By the way, I tried to use "parse_item" in the first place but found no response that is why I used "parse" method instead. Thanks in advance.
Don't name your callback method parse when you use scrapy.CrawlSpider.
From Scrapy documentation:
When writing crawl spider rules, avoid using parse as callback, since
the CrawlSpider uses the parse method itself to implement its logic.
So if you override the parse method, the crawl spider will no longer
work.
Also, you don't need to append an item to list since you already using Scrapy Items and can simply yield item.
This code should work:
# -*- coding: utf-8 -*-
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from craigslist_scraper.items import CraigslistScraperItem
class CraigsPySpider(CrawlSpider):
name = "craigs"
allowed_domains = ["craigslist.org"]
start_urls = (
'http://sfbay.craigslist.org/search/npo/',
)
rules = (
Rule(LinkExtractor(allow=('\/search\/npo\?s=.*',)), callback='parse_item', follow=True),
)
def parse_item(self, response):
page = response.xpath('//p[#class="result-info"]')
for title in page:
item = CraigslistScraperItem()
item["Name"] = title.xpath('.//a[#class="result-title hdrlnk"]/text()').extract_first()
item["Link"] = title.xpath('.//a[#class="result-title hdrlnk"]/#href').extract_first()
yield item
Finally for output in csv format run: scrapy crawl craigs -o items.csv
I am a beginner with python and using scrapy to extract links from the following webpage
http://www.basketball-reference.com/leagues/NBA_2015_games.html.
The code that I have written is
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from basketball.items import BasketballItem
class BasketballSpider(CrawlSpider):
name = 'basketball'
allowed_domains = ['basketball-reference.com/']
start_urls = ['http://www.basketball-reference.com/leagues/NBA_2015_games.html']
rules = [Rule(LinkExtractor(allow=['http://www.basketball-reference.com/boxscores/^\w+$']), 'parse_item')]
def parse_item(self, response):
item = BasketballItem()
item['url'] = response.url
return item
I run this code through the command prompt, but the file created does not have any links. Could someone please help?
It cannot find the links, fix you regular expression in the rule:
rules = [
Rule(LinkExtractor(allow='boxscores/\w+'))
]
Also, you don't have to set the callback when it is called parse_item - it is a default.
And allow can be set as a string also.
rules = [
Rule(LinkExtractor(allow='boxscores/\w+'), callback='parse_item')
]
I am trying to crawl a website using Scrapy, and the urls of every page I want to scrap are all written using a relative path of this kind:
<!-- on page https://www.domain-name.com/en/somelist.html (no <base> in the <head>) -->
Link
Now, in my browser, these links work, and you get to urls like https://www.domain-name.com/en/item-to-scrap.html (despite the relative path going back up twice in hierarchy instead of once)
But my CrawlSpider does not manage to translate these urls into a "correct" one, and all I get is errors of that kind:
2013-10-13 09:30:41-0500 [domain-name.com] DEBUG: Retrying <GET https://www.domain-name.com/../en/item-to-scrap.html> (failed 1 times): 400 Bad Request
Is there a way to fix this, or am I missing something?
Here is my spider's code, fairly basic (on the basis of item urls matching "/en/item-*-scrap.html") :
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item, Field
class Product(Item):
name = Field()
class siteSpider(CrawlSpider):
name = "domain-name.com"
allowed_domains = ['www.domain-name.com']
start_urls = ["https://www.domain-name.com/en/"]
rules = (
Rule(SgmlLinkExtractor(allow=('\/en\/item\-[a-z0-9\-]+\-scrap\.html')), callback='parse_item', follow=True),
Rule(SgmlLinkExtractor(allow=('')), follow=True),
)
def parse_item(self, response):
x = HtmlXPathSelector(response)
product = Product()
product['name'] = ''
name = x.select('//title/text()').extract()
if type(name) is list:
for s in name:
if s != ' ' and s != '':
product['name'] = s
break
return product
Basically deep down, scrapy uses http://docs.python.org/2/library/urlparse.html#urlparse.urljoin for getting the next url by joining currenturl and url link scrapped. And if you join the urls provided you mentioned as example,
<!-- on page https://www.domain-name.com/en/somelist.html -->
Link
the returned url is same as url mentioned in error scrapy error. Try this in python shell.
import urlparse
urlparse.urljoin("https://www.domain-name.com/en/somelist.html","../../en/item-to-scrap.html")
The urljoin behaviour seems to be valid. See : https://www.rfc-editor.org/rfc/rfc1808.html#section-5.2
If it is possible, can you pass the site, which you are crawling ?
With this understanding, the solutions can be,
Manipulate the urls(remove those two dots and slash). generated in crawl spider. Basically override parse or _request_to_folow.
Source of crawl spider: https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/spiders/crawl.py
Manipulate the url in the downloadmiddleware, this might be cleaner. You remove the ../ in the process_request of the downloadmiddleware.
Documentation for downloadmiddleware : http://scrapy.readthedocs.org/en/0.16/topics/downloader-middleware.html
Use base spider and also return the manipulated url requests you want to crawl further
Documentation for the basespider : http://scrapy.readthedocs.org/en/0.16/topics/spiders.html#basespider
Please let me know if you have any questions.
I finally found a solution thanks to this answer. I used process_links as follows:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item, Field
class Product(Item):
name = Field()
class siteSpider(CrawlSpider):
name = "domain-name.com"
allowed_domains = ['www.domain-name.com']
start_urls = ["https://www.domain-name.com/en/"]
rules = (
Rule(SgmlLinkExtractor(allow=('\/en\/item\-[a-z0-9\-]+\-scrap\.html')), process_links='process_links', callback='parse_item', follow=True),
Rule(SgmlLinkExtractor(allow=('')), process_links='process_links', follow=True),
)
def parse_item(self, response):
x = HtmlXPathSelector(response)
product = Product()
product['name'] = ''
name = x.select('//title/text()').extract()
if type(name) is list:
for s in name:
if s != ' ' and s != '':
product['name'] = s
break
return product
def process_links(self,links):
for i, w in enumerate(links):
w.url = w.url.replace("../", "")
links[i] = w
return links
So basically I want to use Scrapy.org in order to scrape a forum. The problem I encounter is that the link to every thread are somewhat along this line http://mywebsite.com/forum/My-Thread-Name-t213.html
Now, if I try to enter just http://mywebsite.com/forum/t213.html it doesn't work, it doesn't show the topic with that ID so I don't really know how I could generate the thread name and the id of each topic in order to be able to scrape it.
I would really appreciate some help with this one, thanks in advance !
In the absence of an actual URL to test, I cannot be absolutely sure that this is going to work. Essentially you need to use a regular expression in a CrawlSpider rule that starts with your base URL and matches that plus any string followed by -t, plus any number and then finally .html.
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
class ThreadSpider(CrawlSpider):
name = "mywebsite"
allowed_domains = ["mywebsite.com"]
start_urls = ["http://mywebsite.com/forum"]
rules = [Rule(SgmlLinkExtractor(allow = ('/[^/]+-t\d+\.html')), follow=True,
callback='parse_item'),]
def parse_item(self, response):
hxs = HtmlXPathSelector(response)
print "We're scraping %s" % response.url
# do something with the hxs object