I'm trying to scrap a website that has a uncommon web-page structure, page upon page upon page until i get to the item i'm trying to extract data from,
edit(Thanks to the answers, I have been able to extract most data I require, however I need the path links to get to the said product)
Here's the code I have so far:
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class MySpider(CrawlSpider):
name = 'drapertools.com'
start_urls = ['https://www.drapertools.com/category/0/Product%20Range']
rules = (
Rule(LinkExtractor(allow=['/category-?.*?/'])),
Rule(LinkExtractor(allow=['/product/']), callback='parse_product'),
)
def parse_product(self, response):
yield {
'product_name': response.xpath('//div[#id="product-title"]//h1[#class="text-primary"]/text()').extract_first(),
'product_number': response.xpath('//div[#id="product-title"]//h1[#style="margin-bottom: 20px; color:#000000; font-size: 23px;"]/text()').extract_first(),
'product_price': response.xpath('//div[#id="product-title"]//p/text()').extract_first(),
'product_desc': response.xpath('//div[#class="col-md-6 col-sm-6 col-xs-12 pull-left"]//div[#class="col-md-11 col-sm-11 col-xs-11"]//p/text()').extract_first(),
'product_path': response.xpath('//div[#class="nav-container"]//ol[#class="breadcrumb"]//li//a/text()').extract(),
'product_path_links': response.xpath('//div[#class="nav-container"]//ol[#class="breadcrumb"]//li//a/href()').extract(),
}
I don't know if this would work or anything, can anyone please help me here?
I would greatly appreciate it.
More Info:
I'm trying to access all categories and all items within them
however there is a categories within them and even more before I can get to the item.
I'm thinking of using Guillaume's LinkExtractor Code but i'm not sure that is supposed to be used for the outcome I want...
rules = (
Rule(LinkExtractor(allow=['/category-?.*?/'])),
Rule(LinkExtractor(allow=['/product/']), callback='parse_product'),
)
Why not using a CrawlSpider instead! It's perfect for this use-case!
It basically gets all the links for every page recursively, and calls a callback only for the interesting ones (I'm assuming that you are interested in the products).
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class MySpider(CrawlSpider):
name = 'drapertools.com'
start_urls = ['https://www.drapertools.com/category/0/Product%20Range']
rules = (
Rule(LinkExtractor(allow=['/category-?.*?/'])),
Rule(LinkExtractor(allow=['/product/']), callback='parse_product'),
)
def parse_product(self, response):
yield {
'product_name': response.xpath('//div[#id="product-title"]//h1[#class="text-primary"]/text()').extract_first(),
}
You have the same structure for all pages, maybe you can shorten it?
import scrapy
class DraperToolsSpider(scrapy.Spider):
name = 'drapertools_spider'
start_urls = ["https://www.drapertools.com/category/0/Product%20Range"]
def parse(self, response):
# this will call self.parse by default for all your categories
for url in response.css('.category p a::attr(href)').extract():
yield scrapy.Request(response.urljoin(url))
# here you can add some "if" if you want to catch details only on certain pages
for req in self.parse_details(response):
yield req
def parse_details(self, response):
yield {}
Related
Introduction
currently im working on a crawler, which saves every link of a domain to a .csv-file
Problem
In my console, i can see, which links its following, but my items are still empty.
I get something like:
Here is my default code
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from ..items import LinkextractorItem
class TopArtSpider(CrawlSpider):
name = "topart"
start_urls = [
'https://www.topart-online.com/de/Bambus-Kunstbaeume/l-KAT11'
]
custom_settings = {'FEED_EXPORT_FIELDS' : ['Link'] }
rules = (
Rule(LinkExtractor(), callback='parse_item', follow=True),
)
def parse_item(self, response):
items = LinkextractorItem()
link = response.xpath('a/#href')
items['Link'] = link
yield items
my start_url is just a category of the domain, because i dont want to wait too long, as long as im trying to build the correct spider.
The XPATH selector isn't searching th entire DOM. Change it to this.
link = response.xpath('//a/#href')
The // searches the entire DOM.
You also are not grabbing the data, so you need to include getall() which will give you a list. You could also use a for loop, to loop around each link which I think is probably the approach you should do.
link = response.xpath('//a/#href')
for a in link:
items['Link'] = a.get()
yield items
I'm new to scrapy and cant get it to do anything. Eventually I want to scrape all the html comments from a website by following internal links.
For now I'm just trying to scrape the internal links and add them to a list.
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class comment_spider(CrawlSpider):
name = 'test'
allowed_domains = ['https://www.andnowuknow.com/']
start_urls = ["https://www.andnowuknow.com/"]
rules = (Rule(LinkExtractor(), callback='parse_start_url', follow=True),)
def parse_start_url(self, response):
return self.parse_item(response)
def parse_item(self, response):
urls = []
for link in LinkExtractor(allow=(),).extract_links(response):
urls.append(link)
print(urls)
I'm just trying get it to print something at this point, nothing I've tried so far works.
It finishes with an exit code of 0, but won't print so I cant tell whats happening.
What am I missing?
Surely your messages log should give us some hints, but I see your allowed_domains has a URL instead of a domain. You should set it like this:
allowed_domains = ["andnowuknow.com"]
(See it in the official documentation)
Hope it helps.
This is my first question here and I'm learning how to code by myself so please bear with me.
I'm working on a final CS50 project which I'm trying to built a website that aggregates online Spanish course from edx.org and other open online couses websites maybe. I'm using scrapy framework to scrap the filter results of Spanish courses on edx.org... Here is my first scrapy spider which I'm trying to get in each courses link to then get it's name (after I get the code right, also get the description, course url and more stuff).
from scrapy.item import Field, Item
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractor import LinkExtractor
from scrapy.loader import ItemLoader
class Course_item(Item):
name = Field()
#description = Field()
#img_url = Field()
class Course_spider(CrawlSpider):
name = 'CourseSpider'
allowed_domains = ['https://www.edx.org/']
start_urls = ['https://www.edx.org/course/?language=Spanish']
rules = (Rule(LinkExtractor(allow=r'/course'), callback='parse_item', follow='True'),)
def parse_item(self, response):
item = ItemLoader(Course_item, response)
item.add_xpath('name', '//*[#id="course-intro-heading"]/text()')
yield item.load_item()
When I run the spider with "scrapy runspider edxSpider.py -o edx.csv -t csv" I get an empty csv file and I also think is not getting into the right spanish courses results.
Basically I want to get in each courses of this link edx Spanish courses and get the name, description, provider, page url and img url.
Any ideas for why might be the problem?
You can't get edx content with a simple request, it uses javascript rendering for getting the course element dynamically, so CrawlSpider won't work on this case, because you need to find specific elements inside the response body to generate a new Request that will get what you need.
The real request (to get the urls of the courses) is this one, but you need to generate it from the previous response body (although you could just visit it an also get the correct data).
So, to generate the real request, you need data that is inside a script tag:
from scrapy import Spider
import re
import json
class Course_spider(Spider):
name = 'CourseSpider'
allowed_domains = ['edx.org']
start_urls = ['https://www.edx.org/course/?language=Spanish']
def parse(self, response):
script_text = response.xpath('//script[contains(text(), "Drupal.settings")]').extract_first()
parseable_json_data = re.search(r'Drupal.settings, ({.+})', script_text).group(1)
json_data = json.loads(parseable_json_data)
...
Now you have what you need on json_data and only need to create the string URL.
This page use JavaScript to get data from server and add to page.
It uses urls like
https://www.edx.org/api/catalog/v2/courses/course-v1:IDBx+IDB33x+3T2017
Last part is course's number which you can find in HTML
<main id="course-info-page" data-course-id="course-v1:IDBx+IDB33x+3T2017">
Code
from scrapy.http import Request
from scrapy.item import Field, Item
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractor import LinkExtractor
from scrapy.loader import ItemLoader
import json
class Course_spider(CrawlSpider):
name = 'CourseSpider'
allowed_domains = ['www.edx.org']
start_urls = ['https://www.edx.org/course/?language=Spanish']
rules = (Rule(LinkExtractor(allow=r'/course'), callback='parse_item', follow='True'),)
def parse_item(self, response):
print('parse_item url:', response.url)
course_id = response.xpath('//*[#id="course-info-page"]/#data-course-id').extract_first()
if course_id:
url = 'https://www.edx.org/api/catalog/v2/courses/' + course_id
yield Request(url, callback=self.parse_json)
def parse_json(self, response):
print('parse_json url:', response.url)
item = json.loads(response.body)
return item
from scrapy.crawler import CrawlerProcess
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
'FEED_FORMAT': 'csv', # csv, json, xml
'FEED_URI': 'output.csv', #
})
c.crawl(Course_spider)
c.start()
from scrapy.http import Request
from scrapy import Spider
import json
class edx_scraper(Spider):
name = "edxScraper"
start_urls = [
'https://www.edx.org/api/v1/catalog/search?selected_facets[]=content_type_exact%3Acourserun&selected_facets[]=language_exact%3ASpanish&page=1&page_size=9&partner=edx&hidden=0&content_type[]=courserun&content_type[]=program&featured_course_ids=course-v1%3AHarvardX+CS50B+Business%2Ccourse-v1%3AMicrosoft+DAT206x+1T2018%2Ccourse-v1%3ALinuxFoundationX+LFS171x+3T2017%2Ccourse-v1%3AHarvardX+HDS2825x+1T2018%2Ccourse-v1%3AMITx+6.00.1x+2T2017_2%2Ccourse-v1%3AWageningenX+NUTR101x+1T2018&featured_programs_uuids=452d5bbb-00a4-4cc9-99d7-d7dd43c2bece%2Cbef7201a-6f97-40ad-ad17-d5ea8be1eec8%2C9b729425-b524-4344-baaa-107abdee62c6%2Cfb8c5b14-f8d2-4ae1-a3ec-c7d4d6363e26%2Ca9cbdeb6-5fc0-44ef-97f7-9ed605a149db%2Cf977e7e8-6376-400f-aec6-84dcdb7e9c73'
]
def parse(self, response):
data = json.loads(response.text)
for course in data['objects']['results']:
url = 'https://www.edx.org/api/catalog/v2/courses/' + course['key']
yield response.follow(url, self.course_parse)
if 'next' in data['objects'] is not None:
yield response.follow(data['objects']['next'], self.parse)
def course_parse(self, response):
course = json.loads(response.text)
yield{
'name': course['title'],
'effort': course['effort'],
}
Here is the code I'm using for scraping all the urls of a domain:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
class UrlsSpider(scrapy.Spider):
name = 'urlsspider'
allowed_domains = ['example.com']
start_urls = ['http://example.com/']
rules = (Rule(LxmlLinkExtractor(allow=(), unique=True), callback='parse', follow=True))
def parse(self, response):
for link in LxmlLinkExtractor(allow_domains=self.allowed_domains, unique=True).extract_links(response):
print link.url
yield scrapy.Request(link.url, callback=self.parse)
As you can see that I've used unique=True but it's still printing duplicate urls in the terminal whereas I want only the unique urls and not duplicate urls.
Any help on this matter will be very helpful.
Since the code looks at the content of the URLs recursively, you will see the duplicate URLs from the parsing of other pages. You essentially have multiple instances of LxmlLinkExtractor().
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
class DmozSpider(BaseSpider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//ul/li')
for site in sites:
title = site.select('a/text()').extract()
link = site.select('a/#href').extract()
desc = site.select('text()').extract()
print title, link, desc
This is my code. I want plenty of URLs to scrape using loop. So how am I suposed to these? I did put multiple urls in there but I didn't get output from all of them. Some URLs stop responding. So how can I get the data for sure using this code?
You code looks ok but are you sure that start_urls shouldn't start with http://
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
UPD
start_urls is a list of urls scrapy starts with. Usually it has one or two links. Rarely more.
This pages must have identical HTML structure because Scrapy spider process them the same way.
See if i put 4-5 url's in start_urls it gives output ok for first 2-3
url's.
I don't believe this because scrapy doesn't care how many links is start_urls list.
But it stops responding and also tell me how i can implement GUI for this.?
Scrapy has debug shell to test you code.
You just posted the code from the tutorial. What you should do is to actually read the whole documentation, especially the basic concept part. What you basically want is the crawl spider where you can define rules that the spider will follow and process with your given code.
To quote the docs with the example:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
class MySpider(CrawlSpider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com']
rules = (
# Extract links matching 'category.php' (but not matching 'subsection.php')
# and follow links from them (since no callback means follow=True by default).
Rule(SgmlLinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),
# Extract links matching 'item.php' and parse them with the spider's method parse_item
Rule(SgmlLinkExtractor(allow=('item\.php', )), callback='parse_item'),
)
def parse_item(self, response):
self.log('Hi, this is an item page! %s' % response.url)
hxs = HtmlXPathSelector(response)
item = Item()
item['id'] = hxs.select('//td[#id="item_id"]/text()').re(r'ID: (\d+)')
item['name'] = hxs.select('//td[#id="item_name"]/text()').extract()
item['description'] = hxs.select('//td[#id="item_description"]/text()').extract()
return item