Scrapy - Crawl and Scrape a website

Scrapy - Crawl and Scrape a website - python

As a part of learning to use Scrapy, I have tried to Crawl Amazon and there is a problem while scraping data,
The output of my code is as follows:
2013-02-25 12:47:21+0530 [scanon] DEBUG: Scraped from <200 http://www.amazon.com/s/ref=sr_pg_2?ie=UTF8&page=2&qid=1361774681&rh=n%3A283155>
{'link': [u'http://www.amazon.com/ObamaCare-Survival-Guide-Nick-Tate/dp/0893348627/ref=sr_1_13?s=books&ie=UTF8&qid=1361774694&sr=1-13',
u'http://www.amazon.com/MELT-Method-Breakthrough-Self-Treatment-Eliminate/dp/0062065351/ref=sr_1_14?s=books&ie=UTF8&qid=1361774694&sr=1-14',
u'http://www.amazon.com/Official-SAT-Study-Guide-2nd/dp/0874478529/ref=sr_1_15?s=books&ie=UTF8&qid=1361774694&sr=1-15',
u'http://www.amazon.com/Inferno-Robert-Langdon-Dan-Brown/dp/0385537859/ref=sr_1_16?s=books&ie=UTF8&qid=1361774694&sr=1-16',
u'http://www.amazon.com/Memory-Light-Wheel-Time/dp/0765325950/ref=sr_1_17?s=books&ie=UTF8&qid=1361774694&sr=1-17',
u'http://www.amazon.com/Jesus-Calling-Enjoying-Peace-Presence/dp/1591451884/ref=sr_1_18?s=books&ie=UTF8&qid=1361774694&sr=1-18',
u'http://www.amazon.com/Fifty-Shades-Grey-Book-Trilogy/dp/0345803485/ref=sr_1_19?s=books&ie=UTF8&qid=1361774694&sr=1-19',
u'http://www.amazon.com/Fifty-Shades-Trilogy-Darker-3-/dp/034580404X/ref=sr_1_20?s=books&ie=UTF8&qid=1361774694&sr=1-20',
u'http://www.amazon.com/Wheat-Belly-Lose-Weight-Health/dp/1609611543/ref=sr_1_21?s=books&ie=UTF8&qid=1361774694&sr=1-21',
u'http://www.amazon.com/Publication-Manual-American-Psychological-Association/dp/1433805618/ref=sr_1_22?s=books&ie=UTF8&qid=1361774694&sr=1-22',
u'http://www.amazon.com/One-Only-Ivan-Katherine-Applegate/dp/0061992259/ref=sr_1_23?s=books&ie=UTF8&qid=1361774694&sr=1-23',
u'http://www.amazon.com/Inquebrantable-Spanish-Jenni-Rivera/dp/1476745420/ref=sr_1_24?s=books&ie=UTF8&qid=1361774694&sr=1-24'],
'title': [u'ObamaCare Survival Guide',
u'The Official SAT Study Guide, 2nd edition',
u'Inferno: A Novel (Robert Langdon)',
u'A Memory of Light (Wheel of Time)',
u'Jesus Calling: Enjoying Peace in His Presence',
u'Fifty Shades of Grey: Book One of the Fifty Shades Trilogy',
u'Fifty Shades Trilogy: Fifty Shades of Grey, Fifty Shades Darker, Fifty Shades Freed 3-volume Boxed Set',
u'Wheat Belly: Lose the Wheat, Lose the Weight, and Find Your Path Back to Health',
u'Publication Manual of the American Psychological Association, 6th Edition',
u'The One and Only Ivan',
u'Inquebrantable (Spanish Edition)'],
'visit_id': '2f4d045a9d6013ef4a7cbc6ed62dc111f6111633',
'visit_status': 'new'}
But, I wanted the output to be captured like this,
2013-02-25 12:47:21+0530 [scanon] DEBUG: Scraped from <200 http://www.amazon.com/s/ref=sr_pg_2?ie=UTF8&page=2&qid=1361774681&rh=n%3A283155>
{'link': [u'http://www.amazon.com/ObamaCare-Survival-Guide-Nick-Tate/dp/0893348627/ref=sr_1_13?s=books&ie=UTF8&qid=1361774694&sr=1-13'],
'title': [u'ObamaCare Survival Guide']}
2013-02-25 12:47:21+0530 [scanon] DEBUG: Scraped from <200 http://www.amazon.com/s/ref=sr_pg_2?ie=UTF8&page=2&qid=1361774681&rh=n%3A283155>
{'link': [u'http://www.amazon.com/Official-SAT-Study-Guide-2nd/dp/0874478529/ref=sr_1_15?s=books&ie=UTF8&qid=1361774694&sr=1-15'],
'title': [u'The Official SAT Study Guide, 2nd edition']}
I think its not a problem with the scrapy or the crawler, but with the FOR loop written.
Following is the code,
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from Amaze.items import AmazeItem
class AmazeSpider2(CrawlSpider):
name = "scanon"
allowed_domains = ["www.amazon.com"]
start_urls = ["http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=books"]
rules = (
Rule(SgmlLinkExtractor(allow=("ref=sr_pg_*")), callback="parse_items_1", follow= True),
)
def parse_items_1(self, response):
items = []
print ('*** response:', response.url)
hxs = HtmlXPathSelector(response)
titles = hxs.select('//h3')
for title in titles:
item = AmazeItem()
item["title"] = title.select('//a[#class="title"]/text()').extract()
item["link"] = title.select('//a[#class="title"]/#href').extract()
print ('**parse-items_1:', item["title"], item["link"])
items.append(item)
return items
Any assistance!

problem is in your Xpath
def parse_items_1(self, response):
items = []
print ('*** response:', response.url)
hxs = HtmlXPathSelector(response)
titles = hxs.select('//h3')
for title in titles:
item = AmazeItem()
item["title"] = title.select('.//a[#class="title"]/text()').extract()
item["link"] = title.select('.//a[#class="title"]/#href').extract()
print ('**parse-items_1:', item["title"], item["link"])
items.append(item)
return items
in above Xpaths you needs to use . in xpath to look into title only other wise your xpath will look on whole page , so it will get allot of matches and will return them,

By the way - you can test our your Xpath expressions in the Scrapy Shell - http://doc.scrapy.org/en/latest/topics/shell.html
Done right, it will save you hours of work and a headache. :)

Use yield to make a generator and fix your xpath selectors:
def parse_items_1(self, response):
hxs = HtmlXPathSelector(response)
titles = hxs.select('//h3')
for title in titles:
item = AmazeItem()
item["title"] = title.select('.//a[#class="title"]/text()').extract()
item["link"] = title.select('.//a[#class="title"]/#href').extract()
yield item

Related

How do I return more than one item with Scrapy?

I'm trying to learn the basics of Scrapy. I've written the below spider to scrape one of the practice websites, books.toscrape.com. The spider scrapes the site and when I just tell it to print title and price it returns them for every book on the site but when I use yield, as below it only returns the information for the last book listed on the site.
I've no doubt my mistake's really simple but I can't work out what it is.
Can anyone tell me why this only scrapes the final title and price listing on the site?
Thanks!
import scrapy
class FirstSpider(scrapy.Spider):
name="CW"
start_urls = ['http://books.toscrape.com/']
def parse(self,response):
books = response.xpath('//article[#class="product_pod"]')
for item in books:
title = item.xpath('.//h3/a/#title').getall()
price = item.xpath('.//div/p[#class="price_color"]').getall()
yield {
'title': title,
'price': price,
}

You misindented the yield:
Fixed:
import scrapy
class FirstSpider(scrapy.Spider):
name="CW"
start_urls = ['http://books.toscrape.com/']
def parse(self,response):
books = response.xpath('//article[#class="product_pod"]')
for item in books:
title = item.xpath('.//h3/a/#title').getall()
price = item.xpath('.//div/p[#class="price_color"]').getall()
yield {
'title': title,
'price': price,
}

How do I scrape the 'next' page in the Scrapy Tutorial?

I'm doing the scrapy tutorial and am on the 'Craigslist Scrapy Spider #3 – Multiple Pages' section but am unable to get more than one page after following the instructions given. The only difference between what I did and what the tutorial showed was I used 'all jobs' rather than only engineering jobs (since there was only one page of engineering jobs). Below is the code I have
import scrapy
from scrapy import Request
class JobsSpider(scrapy.Spider):
name = 'jobs-new'
allowed_domains = ['craigslist.org']
start_urls = ['https://newyork.craigslist.org/search/jjj']
def parse(self, response):
jobs = response.xpath('//p[#class="result-info"]')
for job in jobs:
title = job.xpath('a/text()').extract_first()
address = job.xpath('span[#class="result-meta"]/span[#class="result-hood"]/text()').extract_first("")[2:-1]
relative_url = job.xpath('a/#href').extract_first()
absolute_url = response.urljoin(relative_url)
yield{'URL':absolute_url, 'Title':title, 'Address':address}
relative_next_url = response.xpat('//a[#class="button next"]/#href').extract_first()
absolute_next_url = response.urljoin(relative_next_url)
yield request(absolute_next_url, callback=self.parse)
I ran this in a terminal using
scrapy crawl jobs-new -o jobs-new.csv
but there was only the first page of results within the .csv file.
What do I need to do to get more than one page? Is the tutorial incorrect or did I understand it incorrectly?

I just edit your code and found it ok.
import scrapy
from scrapy import Request
class JobsSpider(scrapy.Spider):
name = 'jobs-new'
allowed_domains = ['craigslist.org']
start_urls = ['https://newyork.craigslist.org/search/jjj']
def parse(self, response):
jobs = response.xpath('//p[#class="result-info"]')
for job in jobs:
title = job.xpath('a/text()').extract_first()
address = job.xpath('span[#class="result-meta"]/span[#class="result-hood"]/text()').extract_first("")[2:-1]
relative_url = job.xpath('a/#href').extract_first()
absolute_url = response.urljoin(relative_url)
yield {'URL': absolute_url, 'Title': title, 'Address': address}
relative_next_url = response.xpath('//a[#class="button next"]/#href').extract_first()
absolute_next_url = response.urljoin(relative_next_url)
yield scrapy.Request(absolute_next_url, callback=self.parse)
Here is some output
{'URL': 'https://newyork.craigslist.org/brk/trp/d/brooklyn-overnight-parking-attendant/7166876233.html', 'Title': 'Overnight Parking Attendant', 'Address': 'Brooklyn, NY'}
{'URL': 'https://newyork.craigslist.org/wch/fbh/d/yonkers-experience-grill-man/7166875818.html', 'Title': 'Experience grill man', 'Address': 'Yonkers'}

Web Scraping: Empty / NA / Null entries when running the spider, correct entries in scrapy shell

Webscraping a website, I'm making a spider that crawls through the pagination for the newest videos, scraping metadata for each of the 32 videos per page.
Next is my code for the spider:
class NaughtySpider(scrapy.Spider):
name = "naughtyspider"
allowed_domains = ["example.com"]
max_pages = 3
# Start request
def start_requests(self):
for i in range(1, self.max_pages):
yield scrapy.Request('https://www.example.com/video?o=cm&page=%s' % i, callback=self.parse_video)
# First parsing method
def parse_video(self, response):
self.log('F i n i s h e d s c r a p i n g ' + response.url)
video_links = response.css('ul#videoCategory').css('li.videoBox').css('div.thumbnail-info-wrapper').css('span.title > a').css('::attr(href)') #Correct path, chooses 32 videos from page ignoring the links coming from ads
links_to_follow = video_links.extract()
for url in links_to_follow:
yield response.follow(url = url,
callback = self.parse_metadata)
# Second parsing method
def parse_metadata(self, response):
# Create a SelectorList of the course titles text
video_title = response.css('div.title-container > h1.title > span.inlineFree::text')
# Extract the text and strip it clean
video_title_ext = video_title.extract_first()
# Extract views
video_views = response.css('span.count::text').extract_first()
# Extract tags
video_tags = response.css('div.tagsWrapper a::text').extract()
# Extract Categories
video_categories = response.css('div.categoriesWrapper a::text').extract()
# Fill in the dictionary
yield {
'title': video_title_ext,
'views': video_views,
'tags': video_tags,
'categories': video_categories,
}
The thing is that almost half the entries end up being empty, with no title, views, tags or categories. Example of the log:
[scrapy.core.scraper] DEBUG: Scraped from <200 https://www.example.com/view_video.php?viewkey=ph5d594b093f8d6>
{'title': None, 'views': None, 'tags': [], 'categories': []}
But at the same time, if I fetch the very same link in the scrapy shell, and copy and paste the very same selector paths in the spider, it gives me the correct values:
In [4]: fetch('https://www.example.com/view_video.php?viewkey=ph5d594b093f8d6')
[scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.example.com/view_video.php?viewkey=ph5d594b093f8d6> (referer: None)
In [5]: response.css('div.tagsWrapper a::text').extract()
Out[5]: ['alday', '559', '+ ']
In [6]: response.css('span.count::text').extract_first()
Out[6]: '6'
Thanks in advance for any help.
Edit: Would I be correct in thinking that this is not a problem with my code but rather a limitation on the server to avoid being scraped?

The data such as the views, duration, ... seems to be called in by a HTML variable element <var> DATA </var>. For example, if you enter the following line in your scrapy shell, you should get the views.
response.xpath(".//var[#class='duration')")
Not sure if it will work, but it is worth the try.
P.s. I'll have to tell my wife it was for educational purposes..

Crawling too many links at the same time

I'm trying to crawl a website and my spider (I don't know why) is crawling my links in such disorder!!
It's crawling all the links I want but it stored only the first one (rank and url_seller as example after)... I'm new in this universe of crawling, python or scrapy but all I want is to learn!! I post my code here, somebody could help me ?
# -*- coding: utf-8 -*-
import scrapy
import re
import numbers
from MarketplacePulse.items import MarketplacepulseItem
import urllib.parse
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class MarketplacePulseSpider(CrawlSpider):
name = 'MP_A_UK'
allowed_domains = ['marketplacepulse.com', 'amazon.co.uk']
start_urls = ['https://www.marketplacepulse.com/amazon/uk']
def parse(self, response):
item = MarketplacepulseItem()
rank = response.xpath('//div/table/tbody/tr/td[#class="number"]/text()').extract()
print('\n', rank, '\n')
url_1 = response.xpath('//div/table/tbody/tr/td/a/#href').extract()
print('\n', url_1, '\n')
for i in range(len(rank)-2):
item['month_rank'] = ''.join(rank[i]).strip()
item['year_rank'] = ''.join(rank[i+1]).strip()
item['lifetime_rank'] = ''.join(rank[i+2]).strip()
i += 3
for i in range(len(url_1)):
url_tmp = urllib.parse.urljoin('https://www.marketplacepulse.com',url_1[i])
yield scrapy.Request(url_tmp, callback=self.parse_2, meta={'item': item})
def parse_2(self, response):
item = response.meta['item']
url_2 = response.xpath('//body/div/section/div/div/div/p/a[contains(text(), "Amazon.co.uk")]/#href').extract()
item['url_seller'] = ''.join(url_2).strip()
yield scrapy.Request(str(url_2), callback=self.parse_3, meta={'item': item})
def parse_3(self, response):
item = response.meta['item']
business_name = response.xpath('//div[#class="a-row a-spacing-medium"]/div[#class="a-column a-span6"]/ul[#class="a-unordered-list a-nostyle a-vertical"]/li//span[#class="a-list-item"]/span[.="Business Name:"]/following-sibling::text()').extract()
phone_number = response.xpath('//div[#class="a-column a-span6"]/ul[#class="a-unordered-list a-nostyle a-vertical"]/li//span[#class="a-list-item"]/span[.="Phone number:"]/following-sibling::text()').extract()
address = response.xpath('//div[#class="a-column a-span6"]/ul[#class="a-unordered-list a-nostyle a-vertical"]/li//span[span[contains(.,"Address:")]]/ul//li//text()').extract()
item['business_name'] = ''.join(business_name).strip()
item['phone_number'] = ''.join(phone_number).strip()
item['address'] = '\n'.join(address).strip()
yield item
I post also an example of what I want and of what I get... you'll see the problem I hope!!
What I want :
2017-07-18 11:28:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.co.uk/sp?_encoding=UTF8&asin=&isAmazonFulfilled=&isCBA=&marketplaceID=A1F83G8C2ARO7P&orderID=&seller=A7CL6GT0UVQKS&tab=&vasStoreID=>
{'address': '55740 Currant Rd\nMishawaka\nIndiana\n46545\nUS',
'business_name': 'Better World Books Marketplace Inc',
'lifetime_rank': '863',
'month_rank': '218',
'phone_number': '',
'url_seller': 'https://www.amazon.co.uk/gp/aag/main?seller=A7CL6GT0UVQKS&tag=mk4343k-21',
'year_rank': '100'}
2017-07-18 11:28:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.co.uk/sp?_encoding=UTF8&asin=&isAmazonFulfilled=&isCBA=&marketplaceID=A1F83G8C2ARO7P&orderID=&seller=W5VG5JB9GHYUG&tab=&vasStoreID=>
{'address': 'ROOM 919, BLOCK 2 West, SEG TECHNOLOGY PARK\n'
'SHENZHEN\n'
'GUANGDONG\n'
'518000\n'
'CN\n'
'FU TIAN QU QIAO XIANG LU HAO FENG YUAN 7 DONG 7A\n'
'SHENZHEN\n'
'GUANGDONG\n'
'518000\n'
'CN',
'business_name': 'MUDDER TECHNOLOGY CO., LTD',
'lifetime_rank': '3',
'month_rank': '28',
'phone_number': '86 18565729081',
'url_seller': 'https://www.amazon.co.uk/gp/aag/main?seller=W5VG5JB9GHYUG&tag=mk4343k-21',
'year_rank': '10'}
And what I get :
2017-07-18 11:28:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.co.uk/sp?_encoding=UTF8&asin=&isAmazonFulfilled=&isCBA=&marketplaceID=A1F83G8C2ARO7P&orderID=&seller=A20T907OQC02JJ&tab=&vasStoreID=>
{'address': '55740 Currant Rd\nMishawaka\nIndiana\n46545\nUS',
'business_name': 'Better World Books Marketplace Inc',
'lifetime_rank': '863',
'month_rank': '218',
'phone_number': '',
'url_seller': 'https://www.amazon.co.uk/gp/aag/main?seller=A7CL6GT0UVQKS&tag=mk4343k-21',
'year_rank': '100'}
2017-07-18 11:28:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.co.uk/sp?_encoding=UTF8&asin=&isAmazonFulfilled=&isCBA=&marketplaceID=A1F83G8C2ARO7P&orderID=&seller=A1XG2K8M6HRQZ8&tab=&vasStoreID=>
{'address': 'ROOM 919, BLOCK 2 West, SEG TECHNOLOGY PARK\n'
'SHENZHEN\n'
'GUANGDONG\n'
'518000\n'
'CN\n'
'FU TIAN QU QIAO XIANG LU HAO FENG YUAN 7 DONG 7A\n'
'SHENZHEN\n'
'GUANGDONG\n'
'518000\n'
'CN',
'business_name': 'MUDDER TECHNOLOGY CO., LTD',
'lifetime_rank': '863',
'month_rank': '218',
'phone_number': '86 18565729081',
'url_seller': 'https://www.amazon.co.uk/gp/aag/main?seller=A7CL6GT0UVQKS&tag=mk4343k-21',
'year_rank': '100'}
You can see that the url_seller are exactly the same and the rank (by month, year or lifetime) also... but I want them to be different..... And the url_seller is not the same that the link I crawled, but it has to be the same..... Any help please?

Ranking issues
I'll walk through it step by step:
You get a list of rank numbers:
rank = response.xpath('//div/table/tbody/tr/td[#class="number"]/text()').extract()
You get a list of URLs:
url_1 = response.xpath('//div/table/tbody/tr/td/a/#href').extract()
Here's where you go wrong:
for i in range(len(rank)-2):
item['month_rank'] = ''.join(rank[i]).strip()
item['year_rank'] = ''.join(rank[i+1]).strip()
item['lifetime_rank'] = ''.join(rank[i+2]).strip()
i += 3
Firstly, since you're using a for loop, your i variable is getting reset to the next item (in this case the next number) at the beginning of each loop, so you're looping through each one, not looping by threes. That i += 3 is doing nothing, sorry.
Anyway, the purpose of the loop appears to be to build the following structure:
{'month_rank': <rank>, 'year_rank': <rank>, 'lifetime_rank': <rank>}
So..., secondly, each time you run this loop, you overwrite the previous set of values without having done anything with them. Oops.
You then proceed to loop through your list of URLs, passing the last set of rankings your previous loop built, along with each url to your parse_2 function.
for i in range(len(url_1)):
url_tmp = urllib.parse.urljoin('https://www.marketplacepulse.com',url_1[i])
yield scrapy.Request(url_tmp, callback=self.parse_2, meta={'item': item})
You end up with each call to parse_2 having the same set of ranking data.
To fix this, you should deal with your your URL and it's associated rankings in the same loop:
for i in range(len(url_1)):
url_tmp = urllib.parse.urljoin('https://www.marketplacepulse.com',url_1[i])
item['month_rank'] = ''.join(rank[i*3]).strip()
item['year_rank'] = ''.join(rank[i*3+1]).strip()
item['lifetime_rank'] = ''.join(rank[i*3+2]).strip()
yield scrapy.Request(url_tmp, callback=self.parse_2, meta={'item': item})
That should fix your rank issues.
url_seller issue
I'm not too sure about the url_seller issue, because it seems like it should use the same url for both item['url_seller'] and its call to parse_3, and it seems like it's using the right info to call parse_3, but continuing to use the same information in item['url_seller'] over and over again.
I'm kind of going out on a limb here, since if I'm understanding the situation properly, both methods should (in the particular case that I think this is) make equal strings, but the only difference I've noticed so far is that for one you're using ''.join(url_2).strip() and for the other you're using str(url_2).
Since the part where you're using str(url_2) seems to be working properly where it's being used, perhaps you should try using it in the other spot too:
item['url_seller'] = str(url_2)

Parsing results of additional requests in scrapy

I am trying to scrape lynda.com courses and storing their info in a csv file. This is my code
# -*- coding: utf-8 -*-
import scrapy
import itertools
class LyndadevSpider(scrapy.Spider):
name = 'lyndadev'
allowed_domains = ['lynda.com']
start_urls = ['https://www.lynda.com/Developer-training-tutorials']
def parse(self, response):
#print(response.url)
titles = response.xpath('//li[#role="presentation"]//h3/text()').extract()
descs = response.xpath('//li[#role="presentation"]//div[#class="meta-description hidden-xs dot-ellipsis dot-resize-update"]/text()').extract()
links = response.xpath('//li[#role="presentation"]/div/div/div[#class="col-xs-8 col-sm-9 card-meta-data"]/a/#href').extract()
for title, desc, link in itertools.izip(titles, descs, links):
#print link
categ = scrapy.Request(link, callback=self.parse2)
yield {'desc': link, 'category': categ}
def parse2(self, response):
#getting categories by storing the navigation info
item = response.xpath('//ol[#role="navigation"]').extract()
return item
What I am trying to do here is that I am grabbing the titles, description of the list of tutorials and then navigating to the url and grabbing the categories in parse2.
However, I get results like this:
category,desc
<GET https://www.lynda.com/SVN-Subversion-tutorials/SVN-Java-Developers/552873-2.html>,https://www.lynda.com/SVN-Subversion-tutorials/SVN-Java-Developers/552873-2.html
<GET https://www.lynda.com/Java-tutorials/WebSocket-Programming-Java-EE/574694-2.html>,https://www.lynda.com/Java-tutorials/WebSocket-Programming-Java-EE/574694-2.html
<GET https://www.lynda.com/GameMaker-tutorials/Building-Physics-Based-Platformer-GameMaker-Studio-Using-GML/598780-2.html>,https://www.lynda.com/GameMaker-tutorials/Building-Physics-Based-Platformer-GameMaker-Studio-Using-GML/598780-2.html
How do I access the information that I want?

You need to yield a scrapy.Request in the parse method that parses the responses of start_urls (instead of yielding a dict). Also, I would rather loop over course items and extract the information for each course item separately.
I'm not sure what you mean exactly by categories. I suppose those are the tags you can see on the course details page at the bottom under Skills covered in this course. But I might be wrong.
Try this code:
# -*- coding: utf-8 -*-
import scrapy
class LyndaSpider(scrapy.Spider):
name = "lynda"
allowed_domains = ["lynda.com"]
start_urls = ['https://www.lynda.com/Developer-training-tutorials']
def parse(self, response):
courses = response.css('ul#category-courses div.card-meta-data')
for course in courses:
item = {
'title': course.css('h3::text').extract_first(),
'desc': course.css('div.meta-description::text').extract_first(),
'link': course.css('a::attr(href)').extract_first(),
}
request = scrapy.Request(item['link'], callback=self.parse_course)
request.meta['item'] = item
yield request
def parse_course(self, response):
item = response.meta['item']
#item['categories'] = response.css('div.tags a em::text').extract()
item['category'] = response.css('ol.breadcrumb li:last-child a span::text').extract_first()
return item

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scrapy - Crawl and Scrape a website - python

By the way - you can test our your Xpath expressions in the Scrapy Shell - http://doc.scrapy.org/en/latest/topics/shell.html Done right, it will save you hours of work and a headache. :)

Related

How do I return more than one item with Scrapy?

How do I scrape the 'next' page in the Scrapy Tutorial?

Web Scraping: Empty / NA / Null entries when running the spider, correct entries in scrapy shell

Crawling too many links at the same time

Parsing results of additional requests in scrapy

Categories

Resources