I'm pretty new to using Scrapy and I'm having difficulties. I'm trying to work with scrapy to crawl a website and return a list of nodes and edges to build a network graph of internal and external websites from my start page to a depth of x (to be determined).
I have the following code and I'm having trouble figuring out what the issue is.
My items.py file looks like this:
from scrapy.item import Item, Field
class SitegraphItem(Item):
url=Field()
linkedurls=Field()
my graphspider.py file is as follows:
from scrapy.selector import HtmlXPathSelector
from scrapy.linkextractors import LinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.utils.url import urljoin_rfc
from sitegraph.items import SitegraphItem
class GraphspiderSpider(CrawlSpider):
name = 'graphspider'
allowed_domains = ['example.com']
start_urls = ['https://www.example.com/products/']
rules = (
Rule(LinkExtractor(allow=r'/'), callback='parse_item', follow=True),
)
def parse_item(self, response):
hxs = HtmlXPathSelector(response)
i = SitegraphItem()
i['url'] = response.url
i['http_status'] = response.status
llinks=[]
for anchor in hxs.select('//a[#href]'):
href=anchor.select('#href').extract()[0]
if not href.lower().startswith("javascript"):
llinks.append(urljoin_rfc(response.url,href))
i['linkedurls'] = llinks
return i
and I modified the settings.py file to include:
BOT_NAME = 'sitegraph'
SPIDER_MODULES = ['sitegraph.spiders']
NEWSPIDER_MODULE = 'sitegraph.spiders'
FEED_FORMAT="jsonlines"
FEED_URI="C:\\Users\Merrie\\Desktop\\testscrape\\sitegraph\\sitegraph.json"
When I run it I'm using the following code:
$ scrapy crawl graphspider -o attempt2.csv
And my output table is empty. It also keeps throwing this error: "KeyError: 'SitegraphItem does not support field: http_status'"
Missing http_statusfield in your items.py causes the error, please update it.
from scrapy.item import Item, Field
class SitegraphItem(Item):
url=Field()
linkedurls=Field()
http_status=Field()
Related
I'm using scrapy to scrape URL from a website and save the results in a csv file. But it is saving in one line only instead of multiple line.I tried to search for an answer in stackoverflow but in vain.Here is my file:
import scrapy
from scrapy.item import Field, Item
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from uomscraprbot.items import UomscraprbotItem
class uomsitelinks(scrapy.Spider):
name = "uom"
allowed_domains = ["uom.ac"]
start_urls = [
"http://www.uom.ac.mu/"]
def parse(self, response):
# print response.xpath('//body//li/a/#href').extract()
item = UomscraprbotItem()
item['url'] = response.xpath('//body//li/a/#href').extract()
return item
i used : scrapy crawl uom -o uom.csv -t csv
i want it to save like this :
www.a.com,
www.b.com,
www.c.com
and not
www.a.com,www.b.com,www.c.com
where did i go wrong in my code?
You need to process each URL separatelly:
def parse(self, response):
# print response.xpath('//body//li/a/#href').extract()
for item_url in response.xpath('//body//li/a/#href').extract():
item = UomscrapebotItem()
item['url'] = item_url
yield item
This is my first question here and I'm learning how to code by myself so please bear with me.
I'm working on a final CS50 project which I'm trying to built a website that aggregates online Spanish course from edx.org and other open online couses websites maybe. I'm using scrapy framework to scrap the filter results of Spanish courses on edx.org... Here is my first scrapy spider which I'm trying to get in each courses link to then get it's name (after I get the code right, also get the description, course url and more stuff).
from scrapy.item import Field, Item
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractor import LinkExtractor
from scrapy.loader import ItemLoader
class Course_item(Item):
name = Field()
#description = Field()
#img_url = Field()
class Course_spider(CrawlSpider):
name = 'CourseSpider'
allowed_domains = ['https://www.edx.org/']
start_urls = ['https://www.edx.org/course/?language=Spanish']
rules = (Rule(LinkExtractor(allow=r'/course'), callback='parse_item', follow='True'),)
def parse_item(self, response):
item = ItemLoader(Course_item, response)
item.add_xpath('name', '//*[#id="course-intro-heading"]/text()')
yield item.load_item()
When I run the spider with "scrapy runspider edxSpider.py -o edx.csv -t csv" I get an empty csv file and I also think is not getting into the right spanish courses results.
Basically I want to get in each courses of this link edx Spanish courses and get the name, description, provider, page url and img url.
Any ideas for why might be the problem?
You can't get edx content with a simple request, it uses javascript rendering for getting the course element dynamically, so CrawlSpider won't work on this case, because you need to find specific elements inside the response body to generate a new Request that will get what you need.
The real request (to get the urls of the courses) is this one, but you need to generate it from the previous response body (although you could just visit it an also get the correct data).
So, to generate the real request, you need data that is inside a script tag:
from scrapy import Spider
import re
import json
class Course_spider(Spider):
name = 'CourseSpider'
allowed_domains = ['edx.org']
start_urls = ['https://www.edx.org/course/?language=Spanish']
def parse(self, response):
script_text = response.xpath('//script[contains(text(), "Drupal.settings")]').extract_first()
parseable_json_data = re.search(r'Drupal.settings, ({.+})', script_text).group(1)
json_data = json.loads(parseable_json_data)
...
Now you have what you need on json_data and only need to create the string URL.
This page use JavaScript to get data from server and add to page.
It uses urls like
https://www.edx.org/api/catalog/v2/courses/course-v1:IDBx+IDB33x+3T2017
Last part is course's number which you can find in HTML
<main id="course-info-page" data-course-id="course-v1:IDBx+IDB33x+3T2017">
Code
from scrapy.http import Request
from scrapy.item import Field, Item
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractor import LinkExtractor
from scrapy.loader import ItemLoader
import json
class Course_spider(CrawlSpider):
name = 'CourseSpider'
allowed_domains = ['www.edx.org']
start_urls = ['https://www.edx.org/course/?language=Spanish']
rules = (Rule(LinkExtractor(allow=r'/course'), callback='parse_item', follow='True'),)
def parse_item(self, response):
print('parse_item url:', response.url)
course_id = response.xpath('//*[#id="course-info-page"]/#data-course-id').extract_first()
if course_id:
url = 'https://www.edx.org/api/catalog/v2/courses/' + course_id
yield Request(url, callback=self.parse_json)
def parse_json(self, response):
print('parse_json url:', response.url)
item = json.loads(response.body)
return item
from scrapy.crawler import CrawlerProcess
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
'FEED_FORMAT': 'csv', # csv, json, xml
'FEED_URI': 'output.csv', #
})
c.crawl(Course_spider)
c.start()
from scrapy.http import Request
from scrapy import Spider
import json
class edx_scraper(Spider):
name = "edxScraper"
start_urls = [
'https://www.edx.org/api/v1/catalog/search?selected_facets[]=content_type_exact%3Acourserun&selected_facets[]=language_exact%3ASpanish&page=1&page_size=9&partner=edx&hidden=0&content_type[]=courserun&content_type[]=program&featured_course_ids=course-v1%3AHarvardX+CS50B+Business%2Ccourse-v1%3AMicrosoft+DAT206x+1T2018%2Ccourse-v1%3ALinuxFoundationX+LFS171x+3T2017%2Ccourse-v1%3AHarvardX+HDS2825x+1T2018%2Ccourse-v1%3AMITx+6.00.1x+2T2017_2%2Ccourse-v1%3AWageningenX+NUTR101x+1T2018&featured_programs_uuids=452d5bbb-00a4-4cc9-99d7-d7dd43c2bece%2Cbef7201a-6f97-40ad-ad17-d5ea8be1eec8%2C9b729425-b524-4344-baaa-107abdee62c6%2Cfb8c5b14-f8d2-4ae1-a3ec-c7d4d6363e26%2Ca9cbdeb6-5fc0-44ef-97f7-9ed605a149db%2Cf977e7e8-6376-400f-aec6-84dcdb7e9c73'
]
def parse(self, response):
data = json.loads(response.text)
for course in data['objects']['results']:
url = 'https://www.edx.org/api/catalog/v2/courses/' + course['key']
yield response.follow(url, self.course_parse)
if 'next' in data['objects'] is not None:
yield response.follow(data['objects']['next'], self.parse)
def course_parse(self, response):
course = json.loads(response.text)
yield{
'name': course['title'],
'effort': course['effort'],
}
The CrawlSpider I've created is not doing it's job properly. It parses the first page and then stops without going on to the next page. Something I'm doing wrong but can't detect. Hope somebody out there gives me a hint what should I do to rectify it.
"items.py" includes:
from scrapy.item import Item, Field
class CraigslistScraperItem(Item):
Name = Field()
Link = Field()
CrawlSpider names "craigs.py" which contains :
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.selector import Selector
from craigslist_scraper.items import CraigslistScraperItem
class CraigsPySpider(CrawlSpider):
name = "craigs"
allowed_domains = ["craigslist.org"]
start_urls = (
'http://sfbay.craigslist.org/search/npo/',
)
rules=(Rule(LinkExtractor(allow = ('sfbay\.craigslist\.org\/search\/npo/.*',
),restrict_xpaths = ('//a[#class="button next"]')),callback = 'parse',follow = True),)
def parse(self, response):
page=response.xpath('//p[#class="result-info"]')
items=[]
for title in page:
item=CraigslistScraperItem()
item["Name"]=title.xpath('.//a[#class="result-title hdrlnk"]/text()').extract()
item["Link"]=title.xpath('.//a[#class="result-title hdrlnk"]/#href').extract()
items.append(item)
return items
And finally the command I'm using to get CSV output is:
scrapy crawl craigs -o items.csv -t csv
By the way, I tried to use "parse_item" in the first place but found no response that is why I used "parse" method instead. Thanks in advance.
Don't name your callback method parse when you use scrapy.CrawlSpider.
From Scrapy documentation:
When writing crawl spider rules, avoid using parse as callback, since
the CrawlSpider uses the parse method itself to implement its logic.
So if you override the parse method, the crawl spider will no longer
work.
Also, you don't need to append an item to list since you already using Scrapy Items and can simply yield item.
This code should work:
# -*- coding: utf-8 -*-
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from craigslist_scraper.items import CraigslistScraperItem
class CraigsPySpider(CrawlSpider):
name = "craigs"
allowed_domains = ["craigslist.org"]
start_urls = (
'http://sfbay.craigslist.org/search/npo/',
)
rules = (
Rule(LinkExtractor(allow=('\/search\/npo\?s=.*',)), callback='parse_item', follow=True),
)
def parse_item(self, response):
page = response.xpath('//p[#class="result-info"]')
for title in page:
item = CraigslistScraperItem()
item["Name"] = title.xpath('.//a[#class="result-title hdrlnk"]/text()').extract_first()
item["Link"] = title.xpath('.//a[#class="result-title hdrlnk"]/#href').extract_first()
yield item
Finally for output in csv format run: scrapy crawl craigs -o items.csv
I'm having trouble with Scrapy. I need code that will scrap up to 1000 internal links per given url. My code works when run at command line, but the spider doesn't stop, only receives the message.
My code is as follows:
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.item import Item, Field
from scrapy.contrib.closespider import CloseSpider
class MyItem(Item):
url= Field()
class MySpider(CrawlSpider):
name = 'testspider1'
allowed_domains = ['angieslist.com']
start_urls = ['http://www.angieslist.com']
rules = (Rule(SgmlLinkExtractor(), callback='parse_url', follow=True), )
def parse_url(self, response):
item = MyItem()
item['url'] = response.url
scrape_count = self.crawler.stats.get_value('item_scraped_count')
print scrape_count
limit = 10
if scrape_count == limit:
raise CloseSpider('Limit Reached')
return item
My problem was trying to apply close spider in the wrong place. It's a variable that needs to be set in the settings.py file. When I set it manually in there, or set it as a argument in the command line, it worked (Stopping within 10-20 of N for what it's worth).
settings.py:
BOT_NAME = 'internal_links'
SPIDER_MODULES = ['internal_links.spiders']
NEWSPIDER_MODULE = 'internal_links.spiders'
CLOSESPIDER_PAGECOUNT = 1000
ITEM_PIPELINES = ['internal_links.pipelines.CsvWriterPipeline']
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'yo mama'
LOG_LEVEL = 'DEBUG'
I am trying to crawl a website using Scrapy, and the urls of every page I want to scrap are all written using a relative path of this kind:
<!-- on page https://www.domain-name.com/en/somelist.html (no <base> in the <head>) -->
Link
Now, in my browser, these links work, and you get to urls like https://www.domain-name.com/en/item-to-scrap.html (despite the relative path going back up twice in hierarchy instead of once)
But my CrawlSpider does not manage to translate these urls into a "correct" one, and all I get is errors of that kind:
2013-10-13 09:30:41-0500 [domain-name.com] DEBUG: Retrying <GET https://www.domain-name.com/../en/item-to-scrap.html> (failed 1 times): 400 Bad Request
Is there a way to fix this, or am I missing something?
Here is my spider's code, fairly basic (on the basis of item urls matching "/en/item-*-scrap.html") :
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item, Field
class Product(Item):
name = Field()
class siteSpider(CrawlSpider):
name = "domain-name.com"
allowed_domains = ['www.domain-name.com']
start_urls = ["https://www.domain-name.com/en/"]
rules = (
Rule(SgmlLinkExtractor(allow=('\/en\/item\-[a-z0-9\-]+\-scrap\.html')), callback='parse_item', follow=True),
Rule(SgmlLinkExtractor(allow=('')), follow=True),
)
def parse_item(self, response):
x = HtmlXPathSelector(response)
product = Product()
product['name'] = ''
name = x.select('//title/text()').extract()
if type(name) is list:
for s in name:
if s != ' ' and s != '':
product['name'] = s
break
return product
Basically deep down, scrapy uses http://docs.python.org/2/library/urlparse.html#urlparse.urljoin for getting the next url by joining currenturl and url link scrapped. And if you join the urls provided you mentioned as example,
<!-- on page https://www.domain-name.com/en/somelist.html -->
Link
the returned url is same as url mentioned in error scrapy error. Try this in python shell.
import urlparse
urlparse.urljoin("https://www.domain-name.com/en/somelist.html","../../en/item-to-scrap.html")
The urljoin behaviour seems to be valid. See : https://www.rfc-editor.org/rfc/rfc1808.html#section-5.2
If it is possible, can you pass the site, which you are crawling ?
With this understanding, the solutions can be,
Manipulate the urls(remove those two dots and slash). generated in crawl spider. Basically override parse or _request_to_folow.
Source of crawl spider: https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/spiders/crawl.py
Manipulate the url in the downloadmiddleware, this might be cleaner. You remove the ../ in the process_request of the downloadmiddleware.
Documentation for downloadmiddleware : http://scrapy.readthedocs.org/en/0.16/topics/downloader-middleware.html
Use base spider and also return the manipulated url requests you want to crawl further
Documentation for the basespider : http://scrapy.readthedocs.org/en/0.16/topics/spiders.html#basespider
Please let me know if you have any questions.
I finally found a solution thanks to this answer. I used process_links as follows:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item, Field
class Product(Item):
name = Field()
class siteSpider(CrawlSpider):
name = "domain-name.com"
allowed_domains = ['www.domain-name.com']
start_urls = ["https://www.domain-name.com/en/"]
rules = (
Rule(SgmlLinkExtractor(allow=('\/en\/item\-[a-z0-9\-]+\-scrap\.html')), process_links='process_links', callback='parse_item', follow=True),
Rule(SgmlLinkExtractor(allow=('')), process_links='process_links', follow=True),
)
def parse_item(self, response):
x = HtmlXPathSelector(response)
product = Product()
product['name'] = ''
name = x.select('//title/text()').extract()
if type(name) is list:
for s in name:
if s != ' ' and s != '':
product['name'] = s
break
return product
def process_links(self,links):
for i, w in enumerate(links):
w.url = w.url.replace("../", "")
links[i] = w
return links