How to Import URLs From Spider to Spider? - python

I am building a Scrapy spider WuzzufLinks that scrapes all the links to specific jobs in a job website in this link:
https://wuzzuf.net/search/jobs/?filters%5Bcountry%5D%5B0%5D=Egypt
After scraping the links, I would like to send them to another spider WuzzufSpider, which scrapes data from inside each link. The start_urls would be the first link in the scraped list, and the next_page would be the following link, and so on.
I have thought of importing the WuzzufLinks into WuzzufSpider then accessing its data:
import scrapy
from ..items import WuzzufscraperItem
class WuzzuflinksSpider(scrapy.Spider):
name = 'WuzzufLinks'
page_number = 1
start_urls = ['https://wuzzuf.net/search/jobs/?filters%5Bcountry%5D%5B0%5D=Egypt']
def parse(self, response):
items = WuzzufscraperItem()
jobURL = response.css('h2[class=css-m604qf] a::attr(href)').extract()
items['jobURL'] = jobURL
yield items
next_page = 'https://wuzzuf.net/search/jobs/?filters%5Bcountry%5D%5B0%5D=Egypt&start=' + str(WuzzuflinksSpider.page_number)
if WuzzuflinksSpider.page_number <= 100:
yield response.follow(next_page, callback = self.parse)
WuzzuflinksSpider.page_number += 1
# WuzzufSpider
import scrapy
from ..items import WuzzufscraperItem
from spiders.WuzzufLinks import WuzzuflinksSpider
class WuzzufspiderSpider(scrapy.Spider):
name = 'WuzzufSpider'
parseClass = WuzzuflinksSpider().parse()
start_urls = []
def parse(self, response):
items = WuzzufscraperItem()
# CSS selectors
title = response.css('').extract()
company = response.css('').extract()
location = response.css('').extract()
country = response.css('').extract()
date = response.css('').extract()
careerLevel = response.css('').extract()
experienceNeeded = response.css('').extract()
jobType = response.css('').extract()
jobFunction = response.css('').extract()
salary = response.css('').extract()
description = response.css('').extract()
requirements = response.css('').extract()
skills = response.css('').extract()
industry = response.css('').extract()
jobURL = response.css('').extract()
# next_page and if statement here
Regardless of whether I have written the outlined parts correctly, I have realized that accessing jobURL would return an empty value since it is only a temporary container. I have thought of saving the scraped links in another file, then importing them to WuzzufSpider, but I don't know whether the import is valid and if they will still be a list:
# links.xml
<?xml version="1.0" encoding="utf-8"?>
<items>
<item><jobURL><value>/jobs/p/P5A2NWkkWfv6-Sales-Operations-Specialist-Amreyah-Cement---InterCement-Alexandria-Egypt?o=1&l=sp&t=sj&a=search-v3</value><value>/jobs/p/pEmZ96R097N3-Senior-Laravel-Developer-Learnovia-Cairo-Egypt?o=2&l=sp&t=sj&a=search-v3</value><value>/jobs/p/IgHkjP37ymQp-French-Talent-Acquisition-Specialist-Guide-Academy-Giza-Egypt?o=3&l=sp&t=sj&a=search-v3</value><value>/jobs/p/zOLTqLqegEZe-Export-Sales-Representative-packtec-Cairo-Egypt?o=4&l=sp&t=sj&a=search-v3</value><value>/jobs/p/U3Q1TDpxzsJJ-Finishing-Site-Engineer--Assiut-Assiut-Egypt?o=5&l=sp&t=sj&a=search-v3</value><value>/jobs/p/7aQ4QxtYV8N6-Senior-QC-Automation-Engineer-FlairsTech-Cairo-Egypt?o=6&l=sp&t=sj&a=search-v3</value><value>/jobs/p/qHWyGU7ClMG6-Technical-Office-Engineer-Cairo-Egypt?o=7&l=sp&t=sj&a=search-v3</value><value>/jobs/p/ptN7qnERUvPT-B2B-Sales-Representative-Smart-Zone-Cairo-Egypt?o=8&l=sp&t=sj&a=search-v3</value><value>/jobs/p/VUVc0ZAyUNYU-Digital-Marketing-supervisor-National-Trade-Distribution-Cairo-Egypt?o=9&l=sp&t=sj&a=search-v3</value><value>/jobs/p/WzJhyeVpT5jb-Receptionist-Value-Cairo-Egypt?o=10&l=sp&t=sj&a=search-v3</value><value>/jobs/p/PAdZOdzWjqbr-Insurance-Specialist-Bancassuranc---Sohag-Allianz-Sohag-Egypt?o=11&l=sp&t=sj&a=search-v3</value><value>/jobs/p/nJD6YbE4QjNX-Senior-Research-And-Development-Specialist-Cairo-Egypt?o=12&l=sp&t=sj&a=search-v3</value><value>/jobs/p/DVvMG4BFWEeI-Technical-Sales-Engineer-Masria-Group-Cairo-Egypt?o=13&l=sp&t=sj&a=search-v3</value><value>/jobs/p/3RtCveEFjveW-Technical-Office-Engineer-Masria-Group-Cairo-Egypt?o=14&l=sp&t=sj&a=search-v3</value><value>/jobs/p/kswGaw4kXTe8-Administrator-Kreston-Cairo-Egypt?o=15&l=sp&t=sj&a=search-v3</value></jobURL></item>
</items>
# WuzzufSpider
import scrapy
from ..items import WuzzufscraperItem
from links import jobURL
class WuzzufspiderSpider(scrapy.Spider):
name = 'WuzzufSpider'
start_urls = [jobURL[0]]
def parse(self, response):
items = WuzzufscraperItem()
# CSS selectors
title = response.css('').extract()
company = response.css('').extract()
location = response.css('').extract()
country = response.css('').extract()
date = response.css('').extract()
careerLevel = response.css('').extract()
experienceNeeded = response.css('').extract()
jobType = response.css('').extract()
jobFunction = response.css('').extract()
salary = response.css('').extract()
description = response.css('').extract()
requirements = response.css('').extract()
skills = response.css('').extract()
industry = response.css('').extract()
jobURL = response.css('').extract()
# next_page and if statement here
Is there is a way to make the second method work or a completely different approach?
I have checked forums Scrapy:Pass data between 2 spiders and Pass scraped URL's from one spider to another. I understand that I can do all of the work in one spider, and that there is a way to save to a database or temporary file in order to send data to another spider. However I am not yet very experienced and don't understand how to implement such changes, so marking this question as a duplicate won't help me. Thank you for your help.

First of all you can keep crawling the urls from the same spider and honestly I don't see a reason for you not to.
Anyway, if you really want to have two spiders, which the output of the first will be the input of the second, you can do something like this:
import scrapy
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from scrapy.utils.project import get_project_settings
from scrapy.signalmanager import dispatcher
from scrapy import signals
from twisted.internet import reactor, defer
# grab all the products urls
class ExampleSpider(scrapy.Spider):
name = "exampleSpider"
start_urls = ['https://scrapingclub.com/exercise/list_basic']
def parse(self, response):
all_urls = response.xpath('//div[#class="card"]/a/#href').getall()
for url in all_urls:
yield {'url': 'https://scrapingclub.com' + url}
# get the product's details
class ExampleSpider2(scrapy.Spider):
name = "exampleSpider2"
def parse(self, response):
title = response.xpath('//h3/text()').get()
price = response.xpath('//div[#class="card-body"]//h4//text()').get()
yield {
'title': title,
'price': price
}
if __name__ == "__main__":
# this will be the yielded items from the first spider
output = []
def get_output(item):
output.append(item)
configure_logging()
settings = get_project_settings()
settings['USER_AGENT'] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
runner = CrawlerRunner(settings)
# run spiders sequentially
# (https://docs.scrapy.org/en/latest/topics/practices.html#running-multiple-spiders-in-the-same-process)
#defer.inlineCallbacks
def crawl():
dispatcher.connect(get_output, signal=signals.item_scraped)
yield runner.crawl('exampleSpider')
urls = [url['url'] for url in output] # create a list of the urls from the first spider
# crawl the second spider with the urls from the first spider
yield runner.crawl('exampleSpider2', start_urls=urls)
reactor.stop()
crawl()
reactor.run()
Run this and see that you first get the results from the first spider, and that those results are passed as the "start_urls" for the second spider.
EDIT:
Doing it all in the same spider. See how we loop over all the urls and scraping them in the function "parse_item". I filled in some of the values you want to scrape as an example, so just fill in the rest and you're done.
import scrapy
# from ..items import WuzzufscraperItem
class WuzzufscraperItem(scrapy.Item):
title = scrapy.Field()
company = scrapy.Field()
location = scrapy.Field()
country = scrapy.Field()
jobURL = scrapy.Field()
date = scrapy.Field()
careerLevel = scrapy.Field()
experienceNeeded = scrapy.Field()
jobType = scrapy.Field()
jobFunction = scrapy.Field()
salary = scrapy.Field()
description = scrapy.Field()
requirements = scrapy.Field()
skills = scrapy.Field()
industry = scrapy.Field()
class WuzzuflinksSpider(scrapy.Spider):
name = 'WuzzufLinks'
page_number = 1
start_urls = ['https://wuzzuf.net/search/jobs/?filters%5Bcountry%5D%5B0%5D=Egypt']
def parse(self, response):
all_urls = response.css('h2[class=css-m604qf] a::attr(href)').getall()
if all_urls:
for url in all_urls:
yield response.follow(url=url, callback=self.parse_item)
next_page = 'https://wuzzuf.net/search/jobs/?filters%5Bcountry%5D%5B0%5D=Egypt&start=' + str(WuzzuflinksSpider.page_number)
if WuzzuflinksSpider.page_number <= 100:
yield response.follow(next_page)
WuzzuflinksSpider.page_number += 1
def parse_item(self, response):
items = WuzzufscraperItem()
# CSS selectors
# Some values as an example:
items['title'] = response.xpath('(//h1)[last()]/text()').get(default='')
items['company'] = response.xpath('(//a[#class="css-p7pghv"])[last()]/text()').get(default='')
items['location'] = response.xpath('(//strong[#class="css-9geu3q"])[last()]/text()').get(default='')
items['country'] = response.xpath('//meta[#property="og:country_name"]/#content').get(default='')
items['jobURL'] = response.url
# items['date'] = response.css('').get(default='')
# items['careerLevel'] = response.css('').get(default='')
# items['experienceNeeded'] = response.css('').get(default='')
# items['jobType'] = response.css('').get(default='')
# items['jobFunction'] = response.css('').get(default='')
# items['salary'] = response.css('').get(default='')
# items['description'] = response.css('').get(default='')
# items['requirements'] = response.css('').get(default='')
# items['skills'] = response.css('').get(default='')
# items['industry'] = response.css('').get(default='')
yield items

Related

Generate JSON dictionary from recursive scrapy functions

I am running the scrapy spider on airbnb for academic purposes below. I scrape all listings first
(such as: https://www.airbnb.com/s/Berlin--Germany/homes?tab_id=all_tab&query=Berlin%2C%20Germany&place_id=ChIJAVkDPzdOqEcRcDteW0YgIQQ&checkin=2020-05-01&adults=1&refinement_paths%5B%5D=%2Fhomes&source=structured_search_input_header&search_type=search_query&checkout=2020-05-02)
to get their ids and then go to the listing's page
(such as: https://www.airbnb.de/rooms/20839690?location=Berlin&check_in=2020-05-01&check_out=2020-05-02&adults=1)
and get the geo-data from the details JSON. Ideally, I would like to have a final JSON nested like:
{{'ID': ID1, 'Title': Title1, 'Latitude': Lat1},{'ID': ID2, 'Title': Title2, 'Latitude': Lat2}}
Because of the recursive structure, I have the full list of title, price etc. already in the first go, while lng and lat are only one element per loop run.
{{Price1, Price2, Price3..., id1, id2...lng1, lat1}, {Price1, Price2, Price3..., id1, id2..., lng2, lat2}}
Any idea how I can restructure the code to get the above structure?
Cheers
marcello
Spider:
# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders import Spider
from scrapy_splash import SplashRequest
from airbnb.items import AirbnbItem
import json
import pprint
all_ids = []
detail = {}
class AirbnbSpider(scrapy.Spider):
name = 'airbnb_spider'
allowed_domains = ['airbnb.com', 'airbnb.de']
start_urls = ['https://www.airbnb.de/s/Berlin/homes?checkin=2020-05-01&checkout=2020-05-02&adults=1']
def parse(self, response):
item = AirbnbItem()
for listing in response.xpath('//div[#class = "_fhph4u"]'):
detail["title"] = listing.xpath('//a[#class = "_i24ijs"]/#aria-label').extract()
detail["price"] = listing.xpath('//span[#class = "_1p7iugi"]/text()').extract()
detail["rating"] = listing.xpath('//span[#class = "_3zgr580"]/text()').get()
detail["id"] = listing.xpath('//a[#class = "_i24ijs"]/#target').extract()
#item["link"] = listing.xpath('//a[#class = "_i24ijs"]/#href').extract()
x_id = [i.split('_')[1] for i in detail['id']]
detail['id'] = x_id
for i in x_id:
link = 'https://www.airbnb.de/api/v2/pdp_listing_details/'+i+'?_format=for_rooms_show&_p3_impression_id=p3_1587291065_1e%2FBlC2IefkrfTQe&adults=1&check_in=2020-05-01&check_out=2020-05-02&key=d306zoyjsyarp7ifhu67rjxn52tv0t20&'
yield scrapy.Request(url = link, callback =self.parse_detail)
def parse_detail(self, response):
jsonresponse = json.loads(response.body_as_unicode())
detail["lat"] = jsonresponse["pdp_listing_detail"]["lat"]
detail["lng"] = jsonresponse["pdp_listing_detail"]["lng"]
return detail
Items
import scrapy
class AirbnbItem(scrapy.Item):
# define the fields for your item here like:
title = scrapy.Field()
price = scrapy.Field()
id = scrapy.Field()
rating = scrapy.Field()
lat = scrapy.Field()
lng = scrapy.Field()
pass
You can pass information to the to the parse_detail method and yield from there
def parse(self, response):
item = AirbnbItem()
for listing in response.xpath('//div[#class = "_fhph4u"]'):
detail["title"] = listing.xpath('//a[#class = "_i24ijs"]/#aria-label').get()
detail["price"] = listing.xpath('//span[#class = "_1p7iugi"]/text()').get()
detail["rating"] = listing.xpath('//span[#class = "_3zgr580"]/text()').get()
detail["id"] = listing.xpath('//a[#class = "_i24ijs"]/#target').get()
#item["link"] = listing.xpath('//a[#class = "_i24ijs"]/#href').get()
detail['id'] = detail['id'].split('_')[1]
link = 'https://www.airbnb.de/api/v2/pdp_listing_details/'+detail['id']+'?_format=for_rooms_show&_p3_impression_id=p3_1587291065_1e%2FBlC2IefkrfTQe&adults=1&check_in=2020-05-01&check_out=2020-05-02&key=d306zoyjsyarp7ifhu67rjxn52tv0t20&'
yield scrapy.Request(url = link,
meta={'item': detail}, #pass information to the next method
callback =self.parse_detail)
def parse_detail(self, response):
jsonresponse = json.loads(response.body_as_unicode())
detail = response.meta['item']
detail["lat"] = jsonresponse["pdp_listing_detail"]["lat"]
detail["lng"] = jsonresponse["pdp_listing_detail"]["lng"]
yield detail
BTW, Item class is useless, do not use it.

Scrapy How to scrape items from multiple pages?

I am trying to scrape data of # pages. I have already done a scraper which can scrape data from a single # page. But it suddenly finished the work after scraping of the first page
The whole file with parse function and scrapd function - Scraper.py
# -*- coding: utf-8 -*-
import scrapy
import csv
import os
from scrapy.selector import Selector
from scrapy import Request
class Proddduct(scrapy.Item):
price = scrapy.Field()
description = scrapy.Field()
link = scrapy.Field()
content = scrapy.Field()
class LapadaScraperSpider(scrapy.Spider):
name = 'lapada_scraper2'
allowed_domains = ['http://www.lapada.org']
start_urls = ['https://lapada.org/art-and-antiques/?search=antique']
def parse(self, response):
next_page_url = response.xpath("//ul/li[#class='next']//a/#href").get()
for item in self.scrape(response):
yield item
if next_page_url:
print("Found url: {}".format(next_page_url))
yield scrapy.Request(url=next_page_url, callback=self.parse)
def scrape(self, response):
parser = scrapy.Selector(response)
products = parser.xpath("//div[#class='content']")
for product in products:
item = Proddduct()
XPATH_PRODUCT_DESCRIPTION = ".//strong/text()"
XPATH_PRODUCT_PRICE = ".//div[#class='price']/text()"
XPATH_PRODUCT_LINK = ".//a/#href"
raw_product_description = product.xpath(XPATH_PRODUCT_DESCRIPTION).extract()
raw_product_price = product.xpath(XPATH_PRODUCT_PRICE).extract()
raw_product_link = product.xpath(XPATH_PRODUCT_LINK).extract_first()
item['description'] = raw_product_description
item['price'] = raw_product_price
item['link'] = raw_product_link
yield item
def get_information(self, response):
item = response.meta['item']
item['phonenumber'] = "12345"
yield item
How can I scrape all items in all pages?
Thanks
Change allowed_domains = ['http://www.lapada.org'] to allowed_domains = ['lapada.org']

Scrapy Spider following urls, but wont export the data

I am trying to grab details from a real estate listing page. I can grab all the data, I just can't seem to export it..
Perhaps a problem with the way I use the yield keyword. The code work for the most part:
Visits page 1, example.com/kittens
Goes to page 2, example.com/puppers. Here are 10 apartments listed in blocks. I can get data from each block, but I need additional info from inside the hyperlink.
Visits the hyperlink, say, example.com/puppers/apartment1. It grabs some info from here as well, but I can't seem to return this data to include it in my HousingItem() class.
import scrapy
from urllib.parse import urljoin
class HousingItem(scrapy.Item):
street = scrapy.Field()
postal = scrapy.Field()
city = scrapy.Field()
url = scrapy.Field()
buildY = scrapy.Field()
on_m = scrapy.Field()
off_m = scrapy.Field()
class FAppSpider(scrapy.Spider):
name = 'f_app'
allowed_domains = ['example.com']
start_urls = ['https://www.example.com/kittens']
def parse(self, response):
yield scrapy.Request(url="https://www.example.com/puppers",
callback=self.parse_puppers)
def parse_inside_pupper(self, response):
item = HousingItem()
item['buildY'] = response.xpath('').extract_first().strip()
item['on_m'] = response.xpath('').extract_first().strip()
item['off_m'] = response.xpath('').extract_first().strip()
def parse_puppers(self, response):
base_url = 'https://www.example.com/'
for block in response.css('div.search-result-main'):
item = HousingItem()
item['street'] = block.css(''),
item['postcode'] = block.css(''),
item['city'] = block.css('')
item['url'] = urljoin(base_url, block.css('div.search-result-header > a::attr(href)')[0].extract())
# Problem area from here..
yield response.follow(url=item['url'],callback=self.parse_inside_pupper)
# yield scrapy.request(url=item['url'],callback=self.parse_inside_pupper)?
yield item
FEED_EXPORT_FIELDS is adjusted in my SETTINGS.py. The 4 items from parse_puppers() get exported correctly, parse_inside_puppers() data is correct in the console, but wont export.
I use scrapy crawl f_app -o raw_data.csv to run me spider. Thanks in advance, appreciate all the help.
p.s. im fairly new to python and practising, i bet you noticed.
You need to send you current item to the parse_inside_pupper using meta param:
def parse_puppers(self, response):
base_url = 'https://www.example.com/'
for block in response.css('div.search-result-main'):
item = HousingItem()
item['street'] = block.css(''),
item['postcode'] = block.css(''),
item['city'] = block.css('')
item['url'] = urljoin(base_url, block.css('div.search-result-header > a::attr(href)')[0].extract())
yield response.follow(url=item['url'],callback=self.parse_inside_pupper, meta={"item": item})
After that you can use it inside parse_inside_pupper (and yield it from here):
def parse_inside_pupper(self, response):
item = response.meta["item"]
item['buildY'] = response.xpath('').extract_first().strip()
item['on_m'] = response.xpath('').extract_first().strip()
item['off_m'] = response.xpath('').extract_first().strip()
yield item

Scrape information from Scraped URL

I am new to scrapy and is currently learning how to scrape information from a list of scraped URL. I have been able to scrape information from a url by going thru the tutorial in scrapy website. However, i am facing problem scraping information from a list of url scraped from a url even after googling for solution online.
The scraper that i have written below is able to scrape from the first url. However, it is unsuccessful in scraping from a list of scraped URL. The problem starts at def parse_following_urls(self, response): whereby i am unable to scrape from the list of scraped URL
Can anyone help to solve this? Thank in advance.
import scrapy
from scrapy.http import Request
class SET(scrapy.Item):
title = scrapy.Field()
open = scrapy.Field()
hi = scrapy.Field()
lo = scrapy.Field()
last = scrapy.Field()
bid = scrapy.Field()
ask = scrapy.Field()
vol = scrapy.Field()
exp = scrapy.Field()
exrat = scrapy.Field()
exdat = scrapy.Field()
class ThaiSpider(scrapy.Spider):
name = "warrant"
allowed_domains = ["marketdata.set.or.th"]
start_urls = ["http://marketdata.set.or.th/mkt/stocklistbytype.do?market=SET&language=en&country=US&type=W"]
def parse(self, response):
for sel in response.xpath('//table[#class]/tbody/tr'):
item = SET()
item['title'] = sel.xpath('td[1]/a[contains(#href,"ssoPageId")]/text()').extract()
item['open'] = sel.xpath('td[3]/text()').extract()
item['hi'] = sel.xpath('td[4]/text()').extract()
item['lo'] = sel.xpath('td[5]/text()').extract()
item['last'] = sel.xpath('td[6]/text()').extract()
item['bid'] = sel.xpath('td[9]/text()').extract()
item['ask'] = sel.xpath('td[10]/text()').extract()
item['vol'] = sel.xpath('td[11]/text()').extract()
yield item
urll = response.xpath('//table[#class]/tbody/tr/td[1]/a[contains(#href,"ssoPageId")]/#href').extract()
urls = ["http://marketdata.set.or.th/mkt/"+ i for i in urll]
for url in urls:
request = scrapy.Request(url, callback=self.parse_following_urls, dont_filter=True)
yield request
request.meta['item'] = item
def parse_following_urls(self, response):
for sel in response.xpath('//table[3]/tbody'):
item = response.meta['item']
item['exp'] = sel.xpath('tr[1]/td[2]/text()').extract()
item['exrat'] = sel.xpath('tr[2]/td[2]/text()').extract()
item['exdat'] = sel.xpath('tr[3]/td[2]/text()').extract()
yield item
I have re wrote the code after trying suggestions given and looking at the output. Below is the edited code. However, i got another error that states that Request url must be str or unicode, got %s:' % type(url).__name__). How do i convert the URL from list to a string?
I thought URL should be in string as it is in a For loop. I have added this as comment in the code below. Is there any way to solve this?
import scrapy
from scrapy.http import Request
class SET(scrapy.Item):
title = scrapy.Field()
open = scrapy.Field()
hi = scrapy.Field()
lo = scrapy.Field()
last = scrapy.Field()
bid = scrapy.Field()
ask = scrapy.Field()
vol = scrapy.Field()
exp = scrapy.Field()
exrat = scrapy.Field()
exdat = scrapy.Field()
class ThaiSpider(scrapy.Spider):
name = "warrant"
allowed_domains = ["marketdata.set.or.th"]
start_urls = ["http://marketdata.set.or.th/mkt/stocklistbytype.do?market=SET&language=en&country=US&type=W"]
def parse(self, response):
for sel in response.xpath('//table[#class]/tbody/tr'):
item = SET()
item['title'] = sel.xpath('td[1]/a[contains(#href,"ssoPageId")]/text()').extract()
item['open'] = sel.xpath('td[3]/text()').extract()
item['hi'] = sel.xpath('td[4]/text()').extract()
item['lo'] = sel.xpath('td[5]/text()').extract()
item['last'] = sel.xpath('td[6]/text()').extract()
item['bid'] = sel.xpath('td[9]/text()').extract()
item['ask'] = sel.xpath('td[10]/text()').extract()
item['vol'] = sel.xpath('td[11]/text()').extract()
url = ["http://marketdata.set.or.th/mkt/"]+ sel.xpath('td[1]/a[contains(#href,"ssoPageId")]/#href').extract()
request = scrapy.Request(url, callback=self.parse_following_urls, dont_filter=True) #Request url must be str or unicode, got list: How to solve this?
request.meta['item'] = item
yield item
yield request
def parse_following_urls(self, response):
for sel in response.xpath('//table[3]/tbody'):
item = response.meta['item']
item['exp'] = sel.xpath('tr[1]/td[2]/text()').extract()
item['exrat'] = sel.xpath('tr[2]/td[2]/text()').extract()
item['exdat'] = sel.xpath('tr[3]/td[2]/text()').extract()
yield item
I see what you are trying to do here, it's called - chaining requests.
What this means is that you want to keep yielding Requests and keep carrying your filled Item in the Request
s meta attribute.
For your case all you need to do is instead of yielding Item yield a Request with an item in it. Change your parse to:
def parse(self, response):
for sel in response.xpath('//table[#class]/tbody/tr'):
item = SET()
item['title'] = sel.xpath('td[1]/a[contains(#href,"ssoPageId")]/text()').extract()
item['open'] = sel.xpath('td[3]/text()').extract()
item['hi'] = sel.xpath('td[4]/text()').extract()
item['lo'] = sel.xpath('td[5]/text()').extract()
item['last'] = sel.xpath('td[6]/text()').extract()
item['bid'] = sel.xpath('td[9]/text()').extract()
item['ask'] = sel.xpath('td[10]/text()').extract()
item['vol'] = sel.xpath('td[11]/text()').extract()
urll = response.xpath('//table[#class]/tbody/tr/td[1]/a[contains(#href,"ssoPageId")]/#href').extract()
urls = ["http://marketdata.set.or.th/mkt/" + i for i in urll]
for url in urls:
yield scrapy.Request(url,
callback=self.parse_following_urls,
meta={'item': item})
I try to change the inverse 5th line
item = response.meta['item']
to
item = SET()
then it works!
Actually I didn't realize your "meta"way very much,since I never use this to describe item.

How to use scrapy to scrape google play reviews of applications?

I wrote this spider to scrape reviews of apps from google play. I am partially successful in this. I am able to extract the name, date, and review only.
My questions:
How to get all the reviews as I am only getting only 41.
How to get the rating from the <div>?
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from urlparse import urljoin
class CompItem(scrapy.Item):
rating = scrapy.Field()
data = scrapy.Field()
name = scrapy.Field()
date = scrapy.Field()
class criticspider(CrawlSpider):
name = "gaana"
allowed_domains = ["play.google.com"]
start_urls = ["https://play.google.com/store/apps/details?id=com.gaana&hl=en"]
# rules = (
# Rule(
# SgmlLinkExtractor(allow=('search=jabong&page=1/+',)),
# callback="parse_start_url",
# follow=True),
# )
def parse(self, response):
sites = response.xpath('//div[#class="single-review"]')
items = []
for site in sites:
item = CompItem()
item['data'] = site.xpath('.//div[#class="review-body"]/text()').extract()
item['name'] = site.xpath('.//div/div/span[#class="author-name"]/a/text()').extract()[0]
item['date'] = site.xpath('.//span[#class="review-date"]/text()').extract()[0]
item['rating'] = site.xpath('div[#class="review-info-star-rating"]/aria-label/text()').extract()
items.append(item)
return items
you have
item['rating'] = site.xpath('div[#class="review-info-star-rating"]/aria-label/text()').extract()
should it not be something like:
item['rating'] = site.xpath('.//div[#class="review-info-star-rating"]/aria-label/text()').extract()
?? dunno if it will work, but try :)
You can try this one out:
item['rating'] = site.xpath('.//div[#class="tiny-star star-rating-non-editable-container"]/#aria-label').extract()

Categories