Getting text in python scrapy

Getting text in python scrapy - python

I get this code from a website:
import scrapy
class BrickSetSpider(scrapy.Spider):
name = "brickset_spider"
start_urls = ['http://brickset.com/sets/year-2016']
def parse(self, response):
SET_SELECTOR = '.set'
for brickset in response.css(SET_SELECTOR):
NAME_SELECTOR = 'h1 a ::text'
yield {
'name': brickset.css(NAME_SELECTOR).extract(),
}
I use the code for crawling data. This is a sample result when I run the code:
The name is the result of extract() method. This is the inspect element (in chrome):
I want to ask about the way to get the result for name is 10805: Around the World or only Around the World. How to do that?

To get "10805: Around the World" change your yield to:
yield {
'name': " ".join(brickset.css(NAME_SELECTOR).extract()),
}
To get "Around the World" change your yield to:
yield {
'name': brickset.css(NAME_SELECTOR).extract()[-1],
}

Related

Scraping Tripadvisor attractions using scrapy and python

I am trying to scrape TripAdvisor's attractions, but I cannot get the names and addresses of each attraction. I suspect I wrote product.css(...) wrong (there are jsons?).
Can anyone tell me how to correct the code to get the name and address of each attraction?
My current code:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'https://www.tripadvisor.com/Attractions-g187427-Activities-oa90-Spain'
]
def parse(self, response):
for link in response.css('.EsZYd a::attr(href)'):
yield response.follow(link.get(), callback=self.parse_categories)
def parse_categories(self, response):
products = response.css('div.eeqnt')
for product in products:
yield {
'name' : product.css('h1.WlYyy cPsXC GeSzT::text').get().strip(),
'address' : product.css('span.WlYyy cacGK Wb::text').get().strip(),
}
Updated code (exporting infro from each atrraction on each page from list):
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'https://www.tripadvisor.com/Attractions-g274862-Activities-a_allAttractions.true-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa30-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa60-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa90-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa120-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa150-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa180-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa210-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa240-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa270-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa300-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa330-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa360-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa390-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa420-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa450-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa480-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa510-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa540-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa570-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa600-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa630-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa660-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa690-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa720-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa750-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa780-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa810-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa840-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa870-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa900-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa930-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa960-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa990-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa1020-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa1050-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa1080-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa1110-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa1140-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa1170-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa1200-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa1230-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa1260-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa1290-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa1320-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa1350-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa1380-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa1410-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa1440-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa1470-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa1500-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa1530-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa1560-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa1590-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa1620-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa1650-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa1680-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa1710-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa1740-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa1770-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa1800-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa1830-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa1860-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa1890-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa1920-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa1950-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa1980-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa2010-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa2040-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa2070-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa2100-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa2130-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa2160-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa2190-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa2220-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa2250-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa2280-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa2310-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa2340-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa2370-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa2400-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa2430-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa2460-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa2490-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa2520-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa2550-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa2580-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa2610-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa2640-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa2670-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa2700-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa2730-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa2760-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa2790-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa2820-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa2850-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa2880-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa2910-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa2940-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa2970-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa3000-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa3030-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa3060-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa3090-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa3120-Slovenia.html',
'https://www.tripadvisor.com/Attractions-g274862-Activities-oa3150-Slovenia.html'
]
def parse(self, response):
for link in response.css('.EsZYd a::attr(href)').getall():
yield response.follow(link, callback=self.parse_categories)
def parse_categories(self, response):
yield {
'name': response.css('h1.WlYyy.cPsXC.GeSzT::text').get(),
'reviews': response.xpath('(//*[#class="cfIVb"])[1]//text()').getall(),
'address': response.xpath('(//*[#class="dGWve"])//text()').getall(),
'url': response.url,
}

It's not really related to python, but css-selectors.
CSS classes should separate with dot and not space WlYyy.cPsXC.GeSzT.
Best suggestion would be to use chrome with dev-toolbar. It will give you an ability to get path to the specific element via css-selector or xpath, just right-click on the element in a DOM-tree and select copy menu-item.
Avoid using classes (especially one without semantic meaning) as an anchor point. They might change from page to page, or in time.
Better to use semantically meaningful nodes, like in your case:
XPath for the title would looks like this //main//header//div[#data-automation="main_h1"]//h1.

You can't use for loop in each listing page
from scrapy.crawler import CrawlerProcess
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'https://www.tripadvisor.com/Attractions-g187427-Activities-oa90-Spain'
]
def parse(self, response):
for link in response.css('.EsZYd a::attr(href)').getall():
#print(link)
yield response.follow(link, callback=self.parse_categories)
def parse_categories(self, response):
yield {
'name' : response.css('h1.WlYyy.cPsXC.GeSzT::text').get(),
'address' :''.join(response.xpath('(//*[#class="hxQKk"])[1]//text()').getall()[:-1]),
'url':response.url
}
if __name__ == "__main__":
process =CrawlerProcess(QuotesSpider)
process.crawl()
process.start()

Scrapy problems with crawling specific TAG

I am having a problem with my scrapy program, I want to crawl information from following website
https://parts.cat.com/AjaxCATPartLookupResultsView?catalogId=10051&langId=-1&requestType=1&storeId=21801&serialNumber=KSN00190&keyword=&link=>
I want to get the "Part No." information inside the "span id=resPartNum" TAG. I have already tried:
- NAME_SELECTOR = './/*[#id="resPartNum"]/text()'
- NAME_SELECTOR = './/span[#class="resPartNum"]/text()
- NAME_SELECTOR = './/tr/td/span[#class="resPartNum"]/a/text()'
Here is my full CODE:
import scrapy
class PartSpider(scrapy.Spider):
name = 'part_spider'
start_urls = ['https://parts.cat.com/AjaxCATPartLookupResultsView?catalogId=10051&langId=-1&requestType=1&storeId=21801&serialNumber=KSN00190&keyword=&link=']
def parse(self, response):
SET_SELECTOR = '.set'
for part in response.css(SET_SELECTOR):
NAME_SELECTOR = './/*[#id="resPartNum"]/text()'
yield {
'name': part.css(NAME_SELECTOR).extract_first(),
}
I am not very advanced in scrapy and would appreciate ANY HELP!!

Use the css selector table.partlookup_table to collect the table item through loop partNum and partName.here extract() return list.
import scrapy
class PartSpider(scrapy.Spider):
name = 'part_spider'
start_urls = ['https://parts.cat.com/AjaxCATPartLookupResultsView?catalogId=10051&langId=-1&requestType=1&storeId=21801&serialNumber=KSN00190&keyword=&link=']
def parse(self, response):
SET_SELECTOR = 'table.partlookup_table'
for part in response.css(SET_SELECTOR):
#NAME_SELECTOR = './/*[#id="resPartNum"]/text()'
yield {
'name': part.css('span.resPartName a::text').extract(),
'partnumber': part.css('span.resPartNum a::text').extract()
}
process = CrawlerProcess()
process.crawl(PartSpider)
process.start()

Scrapy putting out empty JSON / CSV files

I'm very new to scrapy and python and could really do with some help. I've got this code to work in command line. I can see it pulling out all the right information as it goes through the different pages.
My problem is that when I try to save the output of the script to a file it comes out empty. I have looked at lots of other questions on here but can't find anything that helps.
Here is the code
import scrapy
from urlparse import urljoin
class Aberdeenlocations1Spider(scrapy.Spider):
name = "aberdeenlocations2"
start_urls = [
'http://brighthouse.co.uk/store-finder/all-stores',
]
def parse(self, response):
products = response.xpath('//ul/li/a/#href').extract()
for p in products:
url = urljoin(response.url, p)
yield scrapy.Request(url, callback=self.parse_product)
def parse_product(self, response):
for div in response.css('div'):
yield {
title: (response.css('title::text').extract()),
address: (response.css('[itemprop=streetAddress]::text').extract()),
locality: (response.css('[itemprop=addressLocality]::text').extract()),
region: (response.css('[itemprop=addressRegion]::text').extract()),
postcode: (response.css('[itemprop=postalCode]::text').extract()),
telephone: (response.css('[itemprop=telephone]::text').extract()),
script: (response.xpath('//div/script').extract()),
gmaplink: (response.xpath('//div/div/div/p/a/#href').extract_first())
}
I am then running this command on the above script
scrapy crawl aberdeenlocations2 -o data.json
What am I doing wrong?

Just some python errors in your yield I think. Like this I get some data in output:
import scrapy
from urlparse import urljoin
class Aberdeenlocations1Spider(scrapy.Spider):
name = "aberdeenlocations2"
start_urls = [
'http://brighthouse.co.uk/store-finder/all-stores',
]
def parse(self, response):
products = response.xpath('//ul/li/a/#href').extract()
for p in products:
url = urljoin(response.url, p)
yield scrapy.Request(url, callback=self.parse_product)
def parse_product(self, response):
# not sure why this loop is there
for div in response.css('div'):
yield {
'title': response.css('title::text').extract(),
'address': response.css('[itemprop=streetAddress]::text').extract(),
'locality': response.css('[itemprop=addressLocality]::text').extract(),
'region': response.css('[itemprop=addressRegion]::text').extract(),
'postcode': response.css('[itemprop=postalCode]::text').extract(),
'telephone': response.css('[itemprop=telephone]::text').extract(),
'script': response.xpath('//div/script').extract(),
'gmaplink': response.xpath('//div/div/div/p/a/#href').extract_first()
}

Why Is the Web Crawler I made with Python not working?

I am following a tutorial on webscraping by Justin Duke of Digital Ocean. Here is the link to the tutorial
https://www.digitalocean.com/community/tutorials/how-to-crawl-a-web-page-with-scrapy-and-python-3
When I run my code, the web crawler displays the following error:
'BrickSetSpider.parse callback is 'not defined'.
I'm not sure what this means.
Here is the code I used.
import scrapy
class BrickSetSpider(scrapy.Spider):
name = "brickset_spider"
start_urls = ['http://brickset.com/sets/year-2016']
def parse(self, response):
SET_SELECTOR = '.set'
for brickset in response.css(SET_SELECTOR):
NAME_SELECTOR = 'h1 ::text'
yield {
'name': brickset.css(NAME_SELECTOR).extract_first(),
}
I am also pretty new to Python. So, I would appreciate it if your answer to my question was written in such a way that a noobie would be able to understand it.

Like this
import scrapy
class BrickSetSpider(scrapy.Spider):
name = "brickset_spider"
start_urls = ['http://brickset.com/sets/year-2016']
def parse(self, response):
SET_SELECTOR = '.set'
for brickset in response.css(SET_SELECTOR):
NAME_SELECTOR = 'h1 ::text'
yield {
'name': brickset.css(NAME_SELECTOR).extract_first(),
}

How can I get proper response back from scrapy?

I am trying to scrape some search results from this company register, but when i try to scrape the company name my results dont seem to return properly, its like the company name item is split into 2 html items based of the search keyword.
Is there a way to join these together? This is my spider
import scrapy
class QuotesSpider(scrapy.Spider):
name = 'gov2'
start_urls = ['https://beta.companieshouse.gov.uk/search/companies?q=a']
def parse(self, response):
for i in response.css('ul.results-list'):
yield {
'company_name': i.css('li.type-company h3 a::text').extract(),
'address': i.css('li.type-company p::text').extract(),
}
My results as you can see its missing some parts..
Hope any of you see whats going on.. thank you!

As I see, you want to fetch all the texts within a and p tags and there is many tags within this tags.
Try this one and remove the unnecessary spaces through regex:
import scrapy
import re
class QuotesSpider(scrapy.Spider):
name = 'gov2'
start_urls = ['https://beta.companieshouse.gov.uk/search/companies?q=a']
def parse(self, response):
for i in response.css('ul.results-list'):
yield {
'company_name': re.sub('\s+',' ',''.join(i.css('li.type-company h3 a ::text').extract())),
'address': re.sub('\s+',' ',''.join(i.css('li.type-company p ::text').extract())),
}

Using the regex, just modified the code for a better output.
import re
import scrapy
class QuotesSpider(scrapy.Spider):
name = 'gov2'
start_urls = ['https://beta.companieshouse.gov.uk/search/companies?q=a']
def parse(self, response):
for i in response.css('.type-company'):
yield {
'company_name': re.sub('\s+', ' ', ''.join(i.css('h3 a ::text').extract())),
'address': re.sub('\s+', ' ', ''.join(i.css('p ::text').extract())),
}

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Getting text in python scrapy - python

To get "10805: Around the World" change your yield to: yield { 'name': " ".join(brickset.css(NAME_SELECTOR).extract()), } To get "Around the World" change your yield to: yield { 'name': brickset.css(NAME_SELECTOR).extract()[-1], }

Related

Scraping Tripadvisor attractions using scrapy and python

Scrapy problems with crawling specific TAG

Scrapy putting out empty JSON / CSV files

Why Is the Web Crawler I made with Python not working?

How can I get proper response back from scrapy?

Categories

Resources