I am trying to web-scrape multiple pages from a real estate website. I have been successful in scraping the first page of my URL, but unable to handle pagination. I have attempted trying to find a class tag with 'red' in it and identify next sibling. I believe this will get the next page response, and continue doing over and over. I read some examples were people wrote their code to be able to parse multiple pages at the same time.
Is it possible to do parallel/concurrent parsing? I want to be able to parse 90 pages as fast as possible, but don't know how to implement it. Any and all appreciated is greatly and much appreciated. Thank you.
PROGRESS UPDATE 1:
I figured out why my CSV outputs UTF-8 and returns Cyrillic characters correctly in my Pycharm IDE, but returns ?? placeholders when I use Excel. I have been able to bypass this issue by importing CSV file through Excel Data>From Text/CSV.
PROGRESS UPDATE 2: I understand I could implement a for loop in my start_request function and loop pages (1,90) or even (1,120) but that is not what I want, and this would make it so my code parses page by page, rather than concurrently.
HTML Snippet:
<ul class="number-list">
<li>
1
</li>
<li>
2
</li>
<li>
3
</li>
<li><span class="page-number">...</span></li>
<li>
89
</li>
<li>
90
</li>
<div class="clear"></div>
</ul>
Pagination Snippet:
# handling pagination
next_page = response.xpath("//a[contains(#class,'red')]/parent::li/following-sibling::li/a/#href").extract_first()
if next_page:
yield response.follow(next_page, callback=self.parse)
Full Code:
# -*- coding: utf-8 -*-
import scrapy
from scrapy import Request
import unicodecsv as csv
from datetime import datetime
from scrapy.crawler import CrawlerProcess
dt_today = datetime.now().strftime('%Y%m%d')
file_name = dt_today+' HPI Data'
# Create Spider class
class UneguiApartments(scrapy.Spider):
name = "unegui_apts"
allowed_domains = ["www.unegui.mn"]
custom_settings = {"FEEDS": {f'{file_name}.csv': {'format': 'csv'}}
}
def start_requests(self):
urls = ['https://www.unegui.mn/l-hdlh/l-hdlh-zarna/oron-suuts-zarna/ulan-bator/']
for url in urls:
yield Request(url, self.parse)
def parse(self, response, **kwargs):
cards = response.xpath("//li[contains(#class,'announcement-container')]")
# parse details
for card in cards:
name = card.xpath(".//a[#itemprop='name']/#content").extract_first()
price = card.xpath(".//*[#itemprop='price']/#content").extract_first()
rooms = card.xpath(".//div[contains(#class,'announcement-block__breadcrumbs')]/text()").extract_first().split('»')[0].strip()
link = card.xpath(".//a[#itemprop='url']/#href").extract_first()
date_block = card.xpath("normalize-space(.//div[contains(#class,'announcement-block__date')]/text())").extract_first().split(',')
date = date_block[0].strip()
city = date_block[1].strip()
item = {'name': name,
'date': date,
'rooms': rooms,
'price': price,
'city': city,
}
# follow absolute link to scrape deeper level
yield response.follow(link, callback=self.parse_item, meta={'item': item})
def parse_item(self, response):
# retrieve previously scraped item between callbacks
item = response.meta['item']
# parse additional details
list_span = response.xpath(".//span[contains(#class,'value-chars')]//text()").extract()
list_a = response.xpath(".//a[contains(#class, 'value-chars')]//text()").extract()
# get additional details from list of <span> tags, element by element
floor_type = list_span[0].strip()
num_balcony = list_span[1].strip()
garage = list_span[2].strip()
window_type = list_span[3].strip()
door_type = list_span[4].strip()
num_window = list_span[5].strip()
# get additional details from list of <a> tags, element by element
commission_year = list_a[0].strip()
num_floors = list_a[1].strip()
area_sqm = list_a[2].strip()
floor = list_a[3].strip()
leasing = list_a[4].strip()
district = list_a[5].strip()
address = list_a[6].strip()
# update item with newly parsed data
item.update({
'district': district,
'address': address,
'area_sqm': area_sqm,
'floor': floor,
'commission_year': commission_year,
'num_floors': num_floors,
'num_windows': num_window,
'num_balcony': num_balcony,
'floor_type': floor_type,
'window_type': window_type,
'door_type': door_type,
'garage': garage,
'leasing': leasing
})
yield item
# handling pagination
next_page = response.xpath("//a[contains(#class,'red')]/parent::li/following-sibling::li/a/#href").extract_first()
if next_page:
yield response.follow(next_page, callback=self.parse)
# main driver
if __name__ == "__main__":
process = CrawlerProcess()
process.crawl(UneguiApartments)
process.start()
If I understand you correctly you need to move the 'next page' to the parse function. I also just take the 'next page' button value and follow it.
# -*- coding: utf-8 -*-
import scrapy
from scrapy import Request
import unicodecsv as csv
from datetime import datetime
from scrapy.crawler import CrawlerProcess
dt_today = datetime.now().strftime('%Y%m%d')
file_name = dt_today+' HPI Data'
# Create Spider class
class UneguiApartments(scrapy.Spider):
name = "unegui_apts"
allowed_domains = ["www.unegui.mn"]
custom_settings = {"FEEDS": {f'{file_name}.csv': {'format': 'csv'}}
}
def start_requests(self):
urls = ['https://www.unegui.mn/l-hdlh/l-hdlh-zarna/oron-suuts-zarna/ulan-bator/']
for url in urls:
yield Request(url, self.parse)
def parse(self, response, **kwargs):
cards = response.xpath("//li[contains(#class,'announcement-container')]")
# parse details
for card in cards:
name = card.xpath(".//a[#itemprop='name']/#content").extract_first()
price = card.xpath(".//*[#itemprop='price']/#content").extract_first()
rooms = card.xpath(".//div[contains(#class,'announcement-block__breadcrumbs')]/text()").extract_first().split('»')[0].strip()
link = card.xpath(".//a[#itemprop='url']/#href").extract_first()
date_block = card.xpath("normalize-space(.//div[contains(#class,'announcement-block__date')]/text())").extract_first().split(',')
date = date_block[0].strip()
city = date_block[1].strip()
item = {'name': name,
'date': date,
'rooms': rooms,
'price': price,
'city': city,
}
# follow absolute link to scrape deeper level
yield response.follow(link, callback=self.parse_item, meta={'item': item})
# handling pagination
next_page = response.xpath('//a[contains(#class, "number-list-next js-page-filter number-list-line")]/#href').get()
if next_page:
yield response.follow(next_page, callback=self.parse)
def parse_item(self, response):
# retrieve previously scraped item between callbacks
item = response.meta['item']
# parse additional details
list_span = response.xpath(".//span[contains(#class,'value-chars')]//text()").extract()
list_a = response.xpath(".//a[contains(#class, 'value-chars')]//text()").extract()
# get additional details from list of <span> tags, element by element
floor_type = list_span[0].strip()
num_balcony = list_span[1].strip()
garage = list_span[2].strip()
window_type = list_span[3].strip()
door_type = list_span[4].strip()
num_window = list_span[5].strip()
# get additional details from list of <a> tags, element by element
commission_year = list_a[0].strip()
num_floors = list_a[1].strip()
area_sqm = list_a[2].strip()
floor = list_a[3].strip()
leasing = list_a[4].strip()
district = list_a[5].strip()
address = list_a[6].strip()
# update item with newly parsed data
item.update({
'district': district,
'address': address,
'area_sqm': area_sqm,
'floor': floor,
'commission_year': commission_year,
'num_floors': num_floors,
'num_windows': num_window,
'num_balcony': num_balcony,
'floor_type': floor_type,
'window_type': window_type,
'door_type': door_type,
'garage': garage,
'leasing': leasing
})
yield item
# main driver
if __name__ == "__main__":
process = CrawlerProcess()
process.crawl(UneguiApartments)
process.start()
This should work.
Related
I wanted to scrape the feed of sitepoint.com, this is my code:
import scrapy
from urllib.parse import urljoin
class SitepointSpider(scrapy.Spider):
# TODO: Add url tags (like /javascript) to the spider based on class paraneters
name = "sitepoint"
allowed_domains = ["sitepoint.com"]
start_urls = ["http://sitepoint.com/javascript/"]
def parse(self, response):
data = []
for article in response.css("article"):
title = article.css("a.t12xxw3g::text").get()
href = article.css("a.t12xxw3g::attr(href)").get()
img = article.css("img.f13hvvvv::attr(src)").get()
time = article.css("time::text").get()
url = urljoin("https://sitepoint.com", href)
text = scrapy.Request(url, callback=self.parse_article)
data.append(
{"title": title, "href": href, "img": img, "time": time, "text": text}
)
yield data
def parse_article(self, response):
text = response.xpath(
'//*[#id="main-content"]/article/div/div/div[1]/section/text()'
).extract()
yield text
And this is the response I get:-
[{'title': 'How to Build an MVP with React and Firebase',
'href': '/react-firebase-build-mvp/',
'img': 'https://uploads.sitepoint.com/wp-content/uploads/2021/09/1632802723react-firebase-mvp-
app.jpg',
'time': 'September 28, 2021',
'text': <GET https://sitepoint.com/react-firebase-build-mvp/>}]
It just does not scrape the urls. I followed everything said in this question but still could not make it work.
You have to visit the detail page from the listing to scrape the article.
In that case you have to yield the URL first then yield the data in the last spider
Also, the //*[#id="main-content"]/article/div/div/div[1]/section/text() won't return you any text since there are lots of HTML elements under the section tag
One solution is you can scrape all the HTML element inside section tag and clean them later to get your article text data
here is the full working code
import re
import scrapy
from urllib.parse import urljoin
class SitepointSpider(scrapy.Spider):
# TODO: Add url tags (like /javascript) to the spider based on class paraneters
name = "sitepoint"
allowed_domains = ["sitepoint.com"]
start_urls = ["http://sitepoint.com/javascript/"]
def clean_text(self, raw_html):
"""
:param raw_html: this will take raw html code
:return: text without html tags
"""
cleaner = re.compile('<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});')
return re.sub(cleaner, '', raw_html)
def parse(self, response):
for article in response.css("article"):
title = article.css("a.t12xxw3g::text").get()
href = article.css("a.t12xxw3g::attr(href)").get()
img = article.css("img.f13hvvvv::attr(src)").get()
time = article.css("time::text").get()
url = urljoin("https://sitepoint.com", href)
yield scrapy.Request(url, callback=self.parse_article, meta={"title": title,
"href": href,
"img": img,
"time": time})
def parse_article(self, response):
title = response.request.meta["title"]
href = response.request.meta["href"]
img = response.request.meta["img"]
time = response.request.meta["time"]
all_data = {}
article_html = response.xpath('//*[#id="main-content"]/article/div/div/div[1]/section').get()
all_data["title"] = title
all_data["href"] = href
all_data["img"] = img
all_data["time"] = time
all_data["text"] = self.clean_text(article_html)
yield all_data
I can't seem to figure out how to construct this xpath selector. I have even tried using nextsibling::text but to no avail. I have also browsed stackoverflow questions for scraping listed values but could not implement it correctly. I keep getting blank results. Any and all help would be appreciated. Thank you.
The website is https://www.unegui.mn/adv/5737502_10-r-khoroolold-1-oroo/.
Expected Results:
Woods
2015
Current Results:
blank
Current: XPath scrapy code:
list_li = response.xpath(".//ul[contains(#class, 'chars-column')]/li/text()").extract()
list_li = response.xpath("./ul[contains(#class,'value-chars')]//text()").extract()
floor_type = list_li[0].strip()
commission_year = list_li[1].strip()
HTML Snippet:
<div class="announcement-characteristics clearfix">
<ul class="chars-column">
<li class="">
<span class="key-chars">Flooring:</span>
<span class="value-chars">Wood</span></li>
<li class="">
<span class="key-chars">Commission year:</span>
2015
</li>
</ul>
</div>
FURTHER CLARIFICATION:
I previously did two selectors (one for the span list, one for the href list), but the problem was some pages on the website dont follow the same span list/a list order (i.e. on one page the table value would be in a span list, but some other page it would be in a href list). That is why I have been trying to only use one selector and get all the values.
This results in values as shown below in the image. Instead of the number of window aka an integer being scraped, it scrapes the address because on some pages the table value is under the href list not under the span list.
Previous 2 selectors:
list_span = response.xpath(".//span[contains(#class,'value-chars')]//text()").extract()
list_a = response.xpath(".//a[contains(#class,'value-chars')]//text()").extract()
Whole Code (if someone needs it to test it):
# -*- coding: utf-8 -*-
import scrapy
from scrapy import Request
from datetime import datetime
from scrapy.crawler import CrawlerProcess
from selenium import webdriver
dt_today = datetime.now().strftime('%Y%m%d')
filename = dt_today + ' UB HPI Buying Data'
# create Spider class
class UneguiApartmentsSpider(scrapy.Spider):
name = "unegui_apts"
allowed_domains = ["www.unegui.mn"]
custom_settings = {
"FEEDS": {
f'{filename}.csv': {
'format': 'csv',
'overwrite': True}}
}
# function used for start url
def start_requests(self):
urls = ['https://www.unegui.mn/l-hdlh/l-hdlh-zarna/oron-suuts-zarna/ulan-bator/']
for url in urls:
yield Request(url, self.parse)
def parse(self, response, **kwargs):
cards = response.xpath("//li[contains(#class,'announcement-container')]")
# parse details
for card in cards:
name = card.xpath(".//a[#itemprop='name']/#content").extract_first().strip()
price = card.xpath(".//*[#itemprop='price']/#content").extract_first().strip()
rooms = card.xpath("normalize-space(.//div[contains(#class,'announcement-block__breadcrumbs')]/span[2]/text())").extract_first().strip()
link = card.xpath(".//a[#itemprop='url']/#href").extract_first().strip()
date_block = card.xpath("normalize-space(.//div[contains(#class,'announcement-block__date')]/text())").extract_first().split(',')
date = date_block[0].strip()
city = date_block[1].strip()
item = {'name': name,
'date': date,
'rooms': rooms,
'price': price,
'city': city,
}
# follow absolute link to scrape deeper level
yield response.follow(link, callback=self.parse_item, meta={'item': item})
# handling pagination
next_page = response.xpath("//a[contains(#class,'number-list-next js-page-filter number-list-line')]/#href").get()
if next_page:
yield response.follow(next_page, callback=self.parse)
print(f'Scraped {next_page}')
def parse_item(self, response):
# retrieve previously scraped item between callbacks
item = response.meta['item']
# parse additional details
list_li = response.xpath(".//*[contains(#class, 'value-chars')]/text()").extract()
# get additional details from list of <span> tags, element by element
floor_type = list_li[0].strip()
num_balcony = list_li[1].strip()
commission_year = list_li[2].strip()
garage = list_li[3].strip()
window_type = list_li[4].strip()
num_floors = list_li[5].strip()
door_type = list_li[6].strip()
area_sqm = list_li[7].strip()
floor = list_li[8].strip()
leasing = list_li[9].strip()
district = list_li[10].strip()
num_window = list_li[11].strip()
address = list_li[12].strip()
#list_span = response.xpath(".//span[contains(#class,'value-chars')]//text()").extract()
#list_a = response.xpath(".//a[contains(#class,'value-chars')]//text()").extract()
# get additional details from list of <span> tags, element by element
#floor_type = list_span[0].strip()
#num_balcony = list_span[1].strip()
#garage = list_span[2].strip()
#window_type = list_span[3].strip()
#door_type = list_span[4].strip()
#num_window = list_span[5].strip()
# get additional details from list of <a> tags, element by element
#commission_year = list_a[0].strip()
#num_floors = list_a[1].strip()
#area_sqm = list_a[2].strip()
#floor = list_a[3].strip()
#leasing = list_a[4].strip()
#district = list_a[5].strip()
#address = list_a[6].strip()
# update item with newly parsed data
item.update({
'district': district,
'address': address,
'area_sqm': area_sqm,
'floor': floor,
'commission_year': commission_year,
'num_floors': num_floors,
'num_windows': num_window,
'num_balcony': num_balcony,
'floor_type': floor_type,
'window_type': window_type,
'door_type': door_type,
'garage': garage,
'leasing': leasing
})
yield item
def __init__(self):
self.driver = webdriver.Firefox()
def parse_item2(self, response):
self.driver.get(response.url)
while True:
next = self.driver.find_element_by_xpath(".//span[contains(#class,'phone-author__title')]//text()")
try:
next.click()
# get the data and write it to scrapy items
except:
break
self.driver.close()
# main driver
if __name__ == "__main__":
process = CrawlerProcess()
process.crawl(UneguiApartmentsSpider)
process.start()
You need two selectors, one will pass keys and another one will parse values. This will result in two lists that can be zipped together in order to give you the results you are looking for.
CSS Selectors could be like:
Keys Selector --> .chars-column li .key-chars
Values Selector --> .chars-column li .value-chars
Once you extract both lists, you can zip them and consume them as key value.
I suppose this is because of invalid HTML (some span-elements are not closed) normal xpath's are not possible.
This did gave me results:
".//*[contains(#class,'value-chars')]"
The * means any element, so it will select both select
<span class="value-chars">Wood</span>
and
2015
Use this XPath to get Wood
//*[#class="chars-column"]//span[2]//text()
Use this XPath to get 2015
//*[#class="chars-column"]//a[text()="2015"]
I am having problems going through multiple pages. Here is my class for scrapy code called quotes.
class quotes(scrapy.Spider):
name = 'quotes'
start_urls = ['http://books.toscrape.com/?']
def parse(self, response):
all_links = response.css('.nav-list ul li')
for links in all_links:
link = links.css('a::attr(href)').get()
yield response.follow(link, callback = self.books_detail)
def books_detail(self, response):
yas = {
'title':[],
'price':[],
'availability':[],
'category':[]
}
yas['category'].append(response.css('h1::text').extract())
all_divs = response.css('.col-lg-3')
for div in all_divs:
link = div.css('.product_pod a::attr(href)').get()
title = response.follow(link, callback = self.get_title)
yas['price'].append(div.css('.price_color::text').extract())
yas['availability'].append(div.css('.availability::text')[1].extract())
yield yas
def get_title(self,response):
print('testing')
title = response.css('h1::text').extract()
yield {"title":title}
So I use a response.follow to goto function books_details and in that function, I again call response.follow to call get_title. I get the 'title' from get_title and the rest of the details from the main page.
I can scrape the information just fine from the books_details function and I can get the link of the title page just fine as well from the code line.
link = div.css('.product_pod a::attr(href)').get()
But using the response.follow I can not go to the get_title function.
Any help would be appreciated. Thanks.
You should yield request, not run it directly, and use meta= to send data to next parser
yield response.follow(link, callback=self.get_title, meta={'item': yas})
and in next parser you can get it
yas = response.meta['item']
and then you can add new values and yield all data
yas["title"] = response.css('h1::text').extract()
yield yas
See other example in Scrapy yeild items from multiple requests
Doc: Request and Response, Request.meta special keys
Minimal working code which you can put in one file and run as normal script (python script.py) without creating project.
There are other changes.
You shouldn't put all books to one list but yield every book separatelly. Scrapy will keep all results and when you use option to save in csv then it will save all results.
For every book you should create new dictionary. If you use the same dictionary many time then it will ovewrite data and you may get many result with the same data.
import scrapy
class QuotesSpider(scrapy.Spider):
name = 'quotes'
start_urls = ['http://books.toscrape.com/']
def parse(self, response):
all_links = response.css('.nav-list ul li')
for links in all_links:
link = links.css('a::attr(href)').get()
yield response.follow(link, callback=self.books_detail)
def books_detail(self, response):
all_divs = response.css('.col-lg-3')
for div in all_divs:
# every book in separated dictionary and it has to be new dictionary - because it could overwrite old data
book = {
'category': response.css('h1::text').extract(),
'price': div.css('.price_color::text').extract()[0].strip(),
'availability': div.css('.availability::text')[1].extract().strip(),
}
link = div.css('.product_pod a::attr(href)').get()
yield response.follow(link, callback=self.get_title, meta={'item': book})
def get_title(self, response):
book = response.meta['item']
print('testing:', response.url)
book["title"] = response.css('h1::text').extract()[0].strip()
yield book
# --- run without project and save in `output.csv` ---
from scrapy.crawler import CrawlerProcess
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
# save in file CSV, JSON or XML
'FEED_FORMAT': 'csv', # csv, json, xml
'FEED_URI': 'output.csv', #
})
c.crawl(QuotesSpider)
c.start()
I built my first scray spider in several hours for the last two days but i am stuck right now - the main purpose i wanted to achieve is to extract all data to later filter it in csv. Now, the real crucial data for me (Companies without! webpages) is dropped because scrapy can't find the xpath i provided if an item has a homepage. I tried an if statement here, but its not working.
Example website: https://www.achern.de/de/Wirtschaft/Unternehmen-A-Z/Unternehmen?view=publish&item=company&id=1345
I use xPath selector: response.xpath("//div[#class='cCore_contactInformationBlockWithIcon cCore_wwwIcon']/a/#href").extract()
Example non-website: https://www.achern.de/de/Wirtschaft/Unternehmen-A-Z/Unternehmen?view=publish&item=company&id=1512
Spider Code:
# -*- coding: utf-8 -*-
import scrapy
class AchernSpider(scrapy.Spider):
name = 'achern'
allowed_domains = ['www.achern.de']
start_urls = ['https://www.achern.de/de/Wirtschaft/Unternehmen-A-Z/']
def parse(self, response):
for href in response.xpath("//ul[#class='cCore_list cCore_customList']/li[*][*]/a/#href"):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback= self.scrape)
def scrape(self, response):
#Extracting the content using css selectors
print("Processing:"+response.url)
firma = response.css('div>#cMpu_publish_company>h2.cCore_headline::text').extract()
anschrift = response.xpath("//div[contains(#class,'cCore_addressBlock_address')]/text()").extract()
tel = response.xpath("//div[#class='cCore_contactInformationBlockWithIcon cCore_phoneIcon']/text()").extract()
mail = response.xpath(".//div[#class='cCore_contactInformationBlock']//*[contains(text(), '#')]/text()").extract()
web1 = response.xpath("//div[#class='cCore_contactInformationBlockWithIcon cCore_wwwIcon']/a/#href").extract()
if "http:" not in web1:
web = "na"
else:
web = web1
row_data=zip(firma,anschrift,tel,mail,web1) #web1 must be changed to web but then it only give out "n" for every link
#Give the extracted content row wise
for item in row_data:
#create a dictionary to store the scraped info
scraped_info = {
'Firma' : item[0],
'Anschrift' : item[1] +' 77855 Achern',
'Telefon' : item[2],
'Mail' : item[3],
'Web' : item[4],
}
#yield or give the scraped info to scrapy
yield scraped_info
So overall it should export the DROPPED items even "web" is not there..
Hope someone can help, greetings S
Using
response.css(".cCore_wwwIcon > a::attr(href)").get()
gives you a None or the website address, then you can use or to provide a default:
website = response.css(".cCore_wwwIcon > a::attr(href)").get() or 'na'
Also, I refactored your scraper to use css selectors. Note that I've used .get() instead of .extract() to get a single item, not a list, which cleans up the code quite a bit.
import scrapy
from scrapy.crawler import CrawlerProcess
class AchernSpider(scrapy.Spider):
name = 'achern'
allowed_domains = ['www.achern.de']
start_urls = ['https://www.achern.de/de/Wirtschaft/Unternehmen-A-Z/']
def parse(self, response):
for url in response.css("[class*=cCore_listRow] > a::attr(href)").extract():
yield scrapy.Request(url, callback=self.scrape)
def scrape(self, response):
# Extracting the content using css selectors
firma = response.css('.cCore_headline::text').get()
anschrift = response.css('.cCore_addressBlock_address::text').get()
tel = response.css(".cCore_phoneIcon::text").get()
mail = response.css("[href^=mailto]::attr(href)").get().replace('mailto:', '')
website = response.css(".cCore_wwwIcon > a::attr(href)").get() or 'na'
scraped_info = {
'Firma': firma,
'Anschrift': anschrift + ' 77855 Achern',
'Telefon': tel,
'Mail': mail,
'Web': website,
}
yield scraped_info
if __name__ == "__main__":
p = CrawlerProcess()
p.crawl(AchernSpider)
p.start()
output:
with website:
{'Firma': 'Wölfinger Fahrschule GmbH', 'Anschrift': 'Güterhallenstraße 8 77855 Achern', 'Telefon': '07841 6738132', 'Mail': 'info#woelfinger-fahrschule.de', 'Web': 'http://www.woelfinger-fahrschule.de'}
without website:
{'Firma': 'Zappenduster-RC Steffen Liepe', 'Anschrift': 'Am Kirchweg 16 77855 Achern', 'Telefon': '07841 6844700', 'Mail': 'Zappenduster-Rc#hotmail.de', 'Web': 'na'}
I am trying to scrape lynda.com courses and storing their info in a csv file. This is my code
# -*- coding: utf-8 -*-
import scrapy
import itertools
class LyndadevSpider(scrapy.Spider):
name = 'lyndadev'
allowed_domains = ['lynda.com']
start_urls = ['https://www.lynda.com/Developer-training-tutorials']
def parse(self, response):
#print(response.url)
titles = response.xpath('//li[#role="presentation"]//h3/text()').extract()
descs = response.xpath('//li[#role="presentation"]//div[#class="meta-description hidden-xs dot-ellipsis dot-resize-update"]/text()').extract()
links = response.xpath('//li[#role="presentation"]/div/div/div[#class="col-xs-8 col-sm-9 card-meta-data"]/a/#href').extract()
for title, desc, link in itertools.izip(titles, descs, links):
#print link
categ = scrapy.Request(link, callback=self.parse2)
yield {'desc': link, 'category': categ}
def parse2(self, response):
#getting categories by storing the navigation info
item = response.xpath('//ol[#role="navigation"]').extract()
return item
What I am trying to do here is that I am grabbing the titles, description of the list of tutorials and then navigating to the url and grabbing the categories in parse2.
However, I get results like this:
category,desc
<GET https://www.lynda.com/SVN-Subversion-tutorials/SVN-Java-Developers/552873-2.html>,https://www.lynda.com/SVN-Subversion-tutorials/SVN-Java-Developers/552873-2.html
<GET https://www.lynda.com/Java-tutorials/WebSocket-Programming-Java-EE/574694-2.html>,https://www.lynda.com/Java-tutorials/WebSocket-Programming-Java-EE/574694-2.html
<GET https://www.lynda.com/GameMaker-tutorials/Building-Physics-Based-Platformer-GameMaker-Studio-Using-GML/598780-2.html>,https://www.lynda.com/GameMaker-tutorials/Building-Physics-Based-Platformer-GameMaker-Studio-Using-GML/598780-2.html
How do I access the information that I want?
You need to yield a scrapy.Request in the parse method that parses the responses of start_urls (instead of yielding a dict). Also, I would rather loop over course items and extract the information for each course item separately.
I'm not sure what you mean exactly by categories. I suppose those are the tags you can see on the course details page at the bottom under Skills covered in this course. But I might be wrong.
Try this code:
# -*- coding: utf-8 -*-
import scrapy
class LyndaSpider(scrapy.Spider):
name = "lynda"
allowed_domains = ["lynda.com"]
start_urls = ['https://www.lynda.com/Developer-training-tutorials']
def parse(self, response):
courses = response.css('ul#category-courses div.card-meta-data')
for course in courses:
item = {
'title': course.css('h3::text').extract_first(),
'desc': course.css('div.meta-description::text').extract_first(),
'link': course.css('a::attr(href)').extract_first(),
}
request = scrapy.Request(item['link'], callback=self.parse_course)
request.meta['item'] = item
yield request
def parse_course(self, response):
item = response.meta['item']
#item['categories'] = response.css('div.tags a em::text').extract()
item['category'] = response.css('ol.breadcrumb li:last-child a span::text').extract_first()
return item