I am trying to scrape playstation webstore to scrape title, gamelink from the main page and Price for each game from the second page. However when using callback function to parse_page2, all the returned items contain the title and item['link'] value of the most recent item. (last of us remastered )
My code Below:
class PsStoreSpider(scrapy.Spider):
name = 'psstore'
start_urls =['https://store.playstation.com/en-ie/pages/browse']
def parse(self, response):
item = PlaystationItem()
products = response.css('a.psw-link')
for product in products:
item['main_url'] = response.url
item['title'] = product.css('span.psw-t-body.psw-c-t-1.psw-t-truncate-2.psw-m-b-2::text').get()
item['link'] = 'https://store.playstation.com' + product.css('a.psw-link.psw-content-link').attrib['href']
link = 'https://store.playstation.com' + product.css('a.psw-link.psw-content-link').attrib['href']
request = Request(link, callback=self.parse_page2)
request.meta['item'] = item
yield request
def parse_page2(self, response):
item = response.meta['item']
item['price'] = response.css('span.psw-t-title-m::text').get()
item['other_url'] = response.url
yield item
And part of the output:
2022-05-09 19:54:16 [scrapy.core.scraper] DEBUG: Scraped from <200 https://store.playstation.com/en-ie/concept/229261>
{'link': 'https://store.playstation.com/en-ie/concept/228638',
'main_url': 'https://store.playstation.com/en-ie/pages/browse',
'other_url': 'https://store.playstation.com/en-ie/concept/229261',
'price': 'Free',
'title': 'The Last of Us™ Remastered'}
2022-05-09 19:54:16 [scrapy.core.scraper] DEBUG: Scraped from <200 https://store.playstation.com/en-ie/concept/232847>
{'link': 'https://store.playstation.com/en-ie/concept/228638',
'main_url': 'https://store.playstation.com/en-ie/pages/browse',
'other_url': 'https://store.playstation.com/en-ie/concept/232847',
'price': '€59.99',
'title': 'The Last of Us™ Remastered'}
2022-05-09 19:54:16 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://store.playstation.com/en-ie/concept/224802> (referer: https://store.playstation.com/en-ie/pages/browse)
2022-05-09 19:54:16 [scrapy.core.scraper] DEBUG: Scraped from <200 https://store.playstation.com/en-ie/concept/224802>
{'link': 'https://store.playstation.com/en-ie/concept/228638',
'main_url': 'https://store.playstation.com/en-ie/pages/browse',
'other_url': 'https://store.playstation.com/en-ie/concept/224802',
'price': '€29.99',
'title': 'The Last of Us™ Remastered'}
As you can see the price is correctly returned but title and link are taken from the last scraped object. What am I missing here?
Thanks
The thing is that you create your item at the beginning of your parse method and then update it over and over. That also means that you always pass the same item to parse_page2.
If you were to create your item in the for-loop you would get a new one in every iteration and should get the expected result.
Like this:
def parse(self, response):
products = response.css('a.psw-link')
for product in products:
item = PlaystationItem()
item['main_url'] = response.url
item['title'] = product.css('span.psw-t-body.psw-c-t-1.psw-t-truncate-2.psw-m-b-2::text').get()
item['link'] = 'https://store.playstation.com' + product.css('a.psw-link.psw-content-link').attrib['href']
link = 'https://store.playstation.com' + product.css('a.psw-link.psw-content-link').attrib['href']
request = Request(link, callback=self.parse_page2)
request.meta['item'] = item
yield request
Related
2021-05-07 10:07:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tampa.craigslist.org/robots.txt> (referer: None)
2021-05-07 10:07:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tampa.craigslist.org/d/cell-phones/search/moa/> (referer: None)
2021-05-07 10:07:19 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tampa.craigslist.org/d/cell-phones/search/moa?s=120> (referer: https://tampa.craigslist.org/d/cell-phones/search/moa/)
2021-05-07 10:07:21 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tampa.craigslist.org/d/cell-phones/search/moa?s=240> (referer: https://tampa.craigslist.org/d/cell-phones/search/moa?s=120)
this is the output I get, seems like it just moves to the page of results, performed by selecting the next button and performing a request in line 27
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import Rule, Request
from craig.items import CraigItem
from scrapy.selector import Selector
class PhonesSpider(scrapy.Spider):
name = 'phones'
allowed_domains = ['tampa.craigslist.org']
start_urls = ['https://tampa.craigslist.org/d/cell-phones/search/moa/']
def parse(self, response):
phones = response.xpath('//p[#class="result-info"]')
for phone in phones:
relative_url = phone.xpath('a/#href').extract_first()
absolute_url = response.urljoin(relative_url)
title = phone.xpath('a/text()').extract_first()
price = phone.xpath('//*[#id="sortable-results"]/ul/li[3]/a/span').extract_first()
yield Request(absolute_url, callback=self.parse_item, meta={'URL': absolute_url, 'Title': title, 'price': price})
relative_next_url = response.xpath('//a[#class="button next"]/#href').extract_first()
absolute_next_url = "https://tampa.craigslist.org" + relative_next_url
yield Request(absolute_next_url, callback=self.parse)
def parse_item(self, response):
item = CraigItem()
item["cl_id"] = response.meta.get('Title')
item["price"] = response.meta.get
absolute_url = response.meta.get('URL')
yield{'URL': absolute_url, 'Title': title, 'price': price}
Seems like in my code, for phone in phones loop, doesn't run, which results in never running parse_item and continuing to requesting the next url, I am following some tutorials and reading documentation but im still having trouble grasping what I am doing wrong. I have experience with coding arduinos as a hobby when I was young, but no professional coding experience, this is my first forte into a project like this, I have an ok grasp on the basics of loops, functions, callbacks, etc.
any help is greatly appreciated
UPDATE
current output
2021-05-07 15:29:32 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tampa.craigslist.org/robots.txt> (referer: None)
2021-05-07 15:29:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tampa.craigslist.org/d/cell-phones/search/moa/> (referer: None)
2021-05-07 15:29:33 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET https://tampa.craigslist.org/hil/mob/d/tampa-cut-that-high-cable-bill-switch/7309734640.html> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2021-05-07 15:29:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tampa.craigslist.org/hil/mob/d/tampa-cut-that-high-cable-bill-switch/7309734640.html> (referer: https://tampa.craigslist.org/d/cell-phones/search/moa/)
2021-05-07 15:29:36 [scrapy.core.scraper] DEBUG: Scraped from <200 https://tampa.craigslist.org/hil/mob/d/tampa-cut-that-high-cable-bill-switch/7309734640.html>
{'cl_id': 'postid_7309734640',
'price': '$35',
'title': 'Cut that high cable bill, switch to SPC TV and save. 1400 hd '
'channels',
'url': 'https://tampa.craigslist.org/hil/mob/d/tampa-cut-that-high-cable-bill-switch/7309734640.html'}
2021-05-07 15:29:36 [scrapy.core.engine] INFO: Closing spider (finished)
CURRENT CODE
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import Rule, Request
from craig.items import CraigItem
from scrapy.selector import Selector
class PhonesSpider(scrapy.Spider):
name = 'phones'
allowed_domains = ['tampa.craigslist.org']
start_urls = ['https://tampa.craigslist.org/d/cell-phones/search/moa/']
base_url = 'https://tampa.craigslist.org'
def parse(self, response):
phones = response.xpath('//div[#class="result-info"]')
for phone in phones:
x = response.meta.get('x')
n = -1
url = response.xpath('//a[#class="result-title hdrlnk"]/#href').getall()
relative_url = phone.xpath('//a[#class="result-title hdrlnk"]/#href').get()
absolute_url = response.urljoin(relative_url)
title = phone.xpath('//a[#class="result-title hdrlnk"]/text()').getall()
price = phone.xpath('//span[#class="result-price"]/text()').getall()
cl_id = phone.xpath('//a[#class="result-title hdrlnk"]/#id').getall()
yield Request(absolute_url, callback=self.parse_item, meta={'absolute_url': absolute_url, 'url': url, 'title': title, 'price': price, 'cl_id': cl_id, 'n': n})
def parse_item(self, response):
n = response.meta.get('n')
x = n + 1
item = CraigItem()
item["title"] = response.meta.get('title')[x]
item["cl_id"] = response.meta.get('cl_id')[x]
item["price"] = response.meta.get('price')[x]
item["url"] = response.meta.get('url')[x]
yield item
absolute_next_url = response.meta.get('url')[x]
absolute_url = response.meta.get('absolute_url')
yield Request(absolute_next_url, callback=self.parse, meta={'x': x})
I am now able to retrieve the desired content for a posting, URL, Price, Title and craigslist id, now my spider automatically closes after pulling just 1 result, I am having trouble understanding the process of using variables between the 2 functions (x) and (n), logically, after pulling one listings data, as above in the format
cl_id
Price
title
url
I would like to proceed back to the initial parse function and swap to the next item in the list of urls retrieved by
response.xpath('//a[#class="result-title hdrlnk"]/#href').getall()
which (when run in scrapy shell, succesfully pulls all the URLs)
how do I go about implementing this logic of start with [0] in the list, perform parse, perform parse_item, output item, then update a variable (n which starts as 0, needs to + 1 after each item)then call n in parse_item with its updated value and use, for example (item["title"] = response.meta.get('title')[x]) to refer to the list of urls, etc, and which place to select, then run parse_item again outputting 1 at a time, until all the values in the URL list have been output with their related price, cl_id, and title.
I know the code is messy as hell and the basics aren't fully understood by me yet, but im committed to getting this to work and learning it the hard way rather than starting from the ground up for python.
Class result-info is used within the div block, so you should write:
phones = response.xpath('//div[#class="result-info"]')
That being said, I didn't check/fix your spider further (it seems there are only parsing errors, not functional ones).
As a suggestion for the future, you can use Scrapy shell for quickly debugging the issues:
scrapy shell "your-url-here"
I'm kinda of newb with Scrapy. My spider is not working properly when I'm trying to scrape the data from forum. When I'm running my spider, it gives me only the printed urls and stops after. So I think that the problem is in compatibility of two function parse and parse_data but I may be wrong. Here is my code:
import scrapy, time
class ForumSpiderSpider(scrapy.Spider):
name = 'forum_spider'
allowed_domains = ['visforvoltage.org/latest_tech/']
start_urls = ['http://visforvoltage.org/latest_tech//']
def parse(self, response):
for href in response.css(r"tbody a[href*='/forum/']::attr(href)").extract():
url = response.urljoin(href)
print(url)
req = scrapy.Request(url, callback=self.parse_data)
time.sleep(10)
yield req
def parse_data(self, response):
for url in response.css('html').extract():
data = {}
data['name'] = response.css(r"div[class='author-pane-line author-name'] span[class='username']::text").extract()
data['date'] = response.css(r"div[class='forum-posted-on']:contains('-') ::text").extract()
data['title'] = response.css(r"div[class='section'] h1[class='title']::text").extract()
data['body'] = response.css(r"div[class='field-items'] p::text").extract()
yield data
next_page = response.css(r"li[class='pager-next'] a[href*='page=']::attr(href)").extract()
if next_page:
yield scrapy.Request(
response.urljoin(next_page),
callback=self.parse)
Here is the output:
2020-07-23 23:09:58 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'visforvoltage.org': <GET https://visforvoltage.org/forum/14521-aquired-a123-m1-cells-need-charger-and-bms>
https://visforvoltage.org/forum/14448-battery-charger-problems
https://visforvoltage.org/forum/14191-vectrix-trickle-charger
https://visforvoltage.org/forum/14460-what-epoxy-would-you-recommend-loose-magnet-repair
https://visforvoltage.org/forum/14429-importance-correct-grounding-and-well-built-plugs
https://visforvoltage.org/forum/14457-147v-charger-24v-lead-acid-charger-and-dying-vectrix-cells
https://visforvoltage.org/forum/6723-lithium-safety-e-bike
https://visforvoltage.org/forum/11488-how-does-24v-4-wire-reversible-motor-work
https://visforvoltage.org/forum/14444-new-sevcon-gen-4-80v-sale
https://visforvoltage.org/forum/14443-new-sevcon-gen-4-80v-sale
https://visforvoltage.org/forum/12495-3500w-hub-motor-question-about-real-power-and-breaker
https://visforvoltage.org/forum/14402-vectrix-vx-1-battery-pack-problem
https://visforvoltage.org/forum/14068-vectrix-trickle-charger
https://visforvoltage.org/forum/2931-drill-motors
https://visforvoltage.org/forum/14384-help-repairing-gio-hub-motor-freewheel-sprocket
https://visforvoltage.org/forum/14381-zev-charger
https://visforvoltage.org/forum/8726-performance-unite-my1020-1000w-motor
https://visforvoltage.org/forum/7012-controler-mod-veloteq
https://visforvoltage.org/forum/14331-scooter-chargers-general-nfpanec
https://visforvoltage.org/forum/14320-charging-nissan-leaf-cells-lifepo4-charger
https://visforvoltage.org/forum/3763-newber-needs-help-new-gift-kollmorgan-hub-motor
https://visforvoltage.org/forum/14096-european-bldc-controller-seller
https://visforvoltage.org/forum/14242-lithium-bms-vs-manual-battery-balancing
https://visforvoltage.org/forum/14236-mosfet-wiring-ignition-key
https://visforvoltage.org/forum/2007-ok-dumb-question-time%3A-about-golf-cart-controllers
https://visforvoltage.org/forum/10524-my-mf70-recommended-powerpoles-arrived-today
https://visforvoltage.org/forum/9460-how-determine-battery-capacity
https://visforvoltage.org/forum/7705-tricking-0-5-v-hall-effect-throttle
https://visforvoltage.org/forum/13446-overcharged-lead-acid-battery-what-do
https://visforvoltage.org/forum/14157-reliable-high-performance-battery-enoeco-bt-p380
https://visforvoltage.org/forum/2702-hands-test-48-volt-20-ah-lifepo4-pack-ping-battery
https://visforvoltage.org/forum/14034-simple-and-cheap-ev-can-bus-adaptor
https://visforvoltage.org/forum/13933-zivan-ng-3-charger-specs-and-use
https://visforvoltage.org/forum/13099-controllers
https://visforvoltage.org/forum/13866-electric-motor-werks-demos-25-kilowatt-diy-chademo-leaf
https://visforvoltage.org/forum/13796-motor-theory-ac-vs-bldc
https://visforvoltage.org/forum/6184-bypass-bms-lifepo4-good-idea-or-not
https://visforvoltage.org/forum/13763-positive-feedback-kelly-controller
https://visforvoltage.org/forum/13764-any-users-smart-battery-drop-replacement-zapino-and-others
https://visforvoltage.org/forum/13760-contactor-or-fuse-position-circuit-rules-why
https://visforvoltage.org/forum/13759-contactor-or-fuse-position-circuit-rules-why
https://visforvoltage.org/forum/12725-repairing-lithium-battery-pack
https://visforvoltage.org/forum/13752-questions-sepex-motor-theory
https://visforvoltage.org/forum/13738-programming-curtis-controller-software
https://visforvoltage.org/forum/13741-making-own-simple-controller
https://visforvoltage.org/forum/12420-idea-charging-electric-car-portably-wo-relying-electricity-infrastructure
2020-07-23 23:17:28 [scrapy.extensions.logstats] INFO: Crawled 2 pages (at 2 pages/min), scraped 0 items (at 0 items/min)
2020-07-23 23:17:28 [scrapy.core.engine] INFO: Closing spider (finished)
As I see it didn't iterate over these links and collect the data from them. What could be the reason for that?
I will really appreciate for any help. Thank you!
It's work for me.
import scrapy, time
class ForumSpiderSpider(scrapy.Spider):
name = 'forum_spider'
allowed_domains = ['visforvoltage.org/latest_tech/']
start_urls = ['http://visforvoltage.org/latest_tech/']
def parse(self, response):
for href in response.css(r"tbody a[href*='/forum/']::attr(href)").extract():
url = response.urljoin(href)
req = scrapy.Request(url, callback=self.parse_data, dont_filter=True)
yield req
def parse_data(self, response):
for url in response.css('html'):
data = {}
data['name'] = url.css(r"div[class='author-pane-line author-name'] span[class='username']::text").extract()
data['date'] = url.css(r"div[class='forum-posted-on']:contains('-') ::text").extract()
data['title'] = url.css(r"div[class='section'] h1[class='title']::text").extract()
data['body'] = url.css(r"div[class='field-items'] p::text").extract()
yield data
next_page = response.css(r"li[class='pager-next'] a[href*='page=']::attr(href)").extract()
if next_page:
yield scrapy.Request(
response.urljoin(next_page),
callback=self.parse)
The issue probably is that the requests are getting filtered, as they are not part of the allowed domain.
allowed_domains = ['visforvoltage.org/latest_tech/']
New requests url:
https://visforvoltage.org/forum/14448-battery-charger-problems
https://visforvoltage.org/forum/14191-vectrix-trickle-charger
...
Since the requests are to the url visforvoltage.org/forum/ and not to the visforvoltage.org/latest_tech/
You can remove the allowed domain property entirely, or change to:
allowed_domains = ['visforvoltage.org']
This will make them crawl the page, you will see a different value in this line in your log:
2020-07-23 23:17:28 [scrapy.extensions.logstats] INFO: Crawled 2 pages (at 2 pages/min), scraped 0 items (at 0 items/min)
However the selectors in the parsing don't seem right.
This selector will select the whole page, and the extract() method will return it as a list. So you will have a list, with only one string that composed of all the HTML of the page.
response.css('html').extract()
You can read more on selectors and the getall()/extract() method here.
my spider starts on this page https://finviz.com/screener.ashx and visits every link in the table to yield some items on the other side. This worked perfectly fine. I then wanted to add another layer of depth by having my spider visit a link on the page it initially visits like so:
start_urls > url > url_2
The spider is supposed to visit "url", yield some items along the way, then visit "url_2" and yield a few more items and then move on to the next url from the start_url.
Here is my spider code:
import scrapy
from scrapy import Request
from dimstatistics.items import DimstatisticsItem
class StatisticsSpider(scrapy.Spider):
name = 'statistics'
def __init__(self):
self.start_urls = ['https://finviz.com/screener.ashx? v=111&f=ind_stocksonly&r=01']
npagesscreener = 1000
for i in range(1, npagesscreener + 1):
self.start_urls.append("https://finviz.com/screener.ashx? v=111&f=ind_stocksonly&r="+str(i)+"1")
def parse(self, response):
for href in response.xpath("//td[contains(#class, 'screener-body-table-nw')]/a/#href"):
url = "https://www.finviz.com/" + href.extract()
yield follow.Request(url, callback=self.parse_dir_contents)
def parse_dir_contents(self, response):
item = {}
item['statisticskey'] = response.xpath("//a[contains(#class, 'fullview-ticker')]//text()").extract()[0]
item['shares_outstanding'] = response.xpath("//table[contains(#class, 'snapshot-table2')]/tr/td/descendant::text()").extract()[9]
item['shares_float'] = response.xpath("//table[contains(#class, 'snapshot-table2')]/tr/td/descendant::text()").extract()[21]
item['short_float'] = response.xpath("//table[contains(#class, 'snapshot-table2')]/tr/td/descendant::text()").extract()[33]
item['short_ratio'] = response.xpath("//table[contains(#class, 'snapshot-table2')]/tr/td/descendant::text()").extract()[45]
item['institutional_ownership'] = response.xpath("//table[contains(#class, 'snapshot-table2')]/tr/td/descendant::text()").extract()[7]
item['institutional_transactions'] = response.xpath("//table[contains(#class, 'snapshot-table2')]/tr/td/descendant::text()").extract()[19]
item['employees'] = response.xpath("//table[contains(#class, 'snapshot-table2')]/tr/td/descendant::text()").extract()[97]
item['recommendation'] = response.xpath("//table[contains(#class, 'snapshot-table2')]/tr/td/descendant::text()").extract()[133]
yield item
url2 = response.xpath("//table[contains(#class, 'fullview-links')]//a/#href").extract()[0]
yield response.follow(url2, callback=self.parse_dir_stats)
def parse_dir_stats(self, response):
item = {}
item['effective_tax_rate_ttm_company'] = response.xpath("//tr[td[normalize-space()='Effective Tax Rate (TTM)']]/td[2]/text()").extract()
item['effective_tax_rate_ttm_industry'] = response.xpath("//tr[td[normalize-space()='Effective Tax Rate (TTM)']]/td[3]/text()").extract()
item['effective_tax_rate_ttm_sector'] = response.xpath("//tr[td[normalize-space()='Effective Tax Rate (TTM)']]/td[4]/text()").extract()
item['effective_tax_rate_5_yr_avg_company'] = response.xpath("//tr[td[normalize-space()='Effective Tax Rate - 5 Yr. Avg.']]/td[2]/text()").extract()
item['effective_tax_rate_5_yr_avg_industry'] = response.xpath("//tr[td[normalize-space()='Effective Tax Rate - 5 Yr. Avg.']]/td[3]/text()").extract()
item['effective_tax_rate_5_yr_avg_sector'] = response.xpath("//tr[td[normalize-space()='Effective Tax Rate - 5 Yr. Avg.']]/td[4]/text()").extract()
yield item
All of the xpaths and links are right, I just can't seem to yield anything at all now. I have a feeling there is an obvious mistake here. My first try at a more elaborate spider.
Any help would be greatly appreciated! Thank you!
***EDIT 2
{'statisticskey': 'AMRB', 'shares_outstanding': '5.97M', 'shares_float':
'5.08M', 'short_float': '0.04%', 'short_ratio': '0.63',
'institutional_ownership': '10.50%', 'institutional_transactions': '2.74%',
'employees': '101', 'recommendation': '2.30'}
2019-03-06 18:45:19 [scrapy.core.scraper] DEBUG: Scraped from <200
https://www.finviz.com/quote.ashx?t=AMR&ty=c&p=d&b=1>
{'statisticskey': 'AMR', 'shares_outstanding': '154.26M', 'shares_float':
'89.29M', 'short_float': '13.99%', 'short_ratio': '4.32',
'institutional_ownership': '0.10%', 'institutional_transactions': '-',
'employees': '-', 'recommendation': '3.00'}
2019-03-06 18:45:19 [scrapy.core.scraper] DEBUG: Scraped from <200
https://www.finviz.com/quote.ashx?t=AMD&ty=c&p=d&b=1>
{'statisticskey': 'AMD', 'shares_outstanding': '1.00B', 'shares_float':
'997.92M', 'short_float': '11.62%', 'short_ratio': '1.27',
'institutional_ownership': '0.70%', 'institutional_transactions': '-83.83%',
'employees': '10100', 'recommendation': '2.50'}
2019-03-06 18:45:19 [scrapy.core.scraper] DEBUG: Scraped from <200
https://www.finviz.com/quote.ashx?t=AMCX&ty=c&p=d&b=1>
{'statisticskey': 'AMCX', 'shares_outstanding': '54.70M', 'shares_float':
'43.56M', 'short_float': '20.94%', 'short_ratio': '14.54',
'institutional_ownership': '3.29%', 'institutional_transactions': '0.00%',
'employees': '1872', 'recommendation': '3.00'}
2019-03-06 18:45:19 [scrapy.core.scraper] DEBUG: Scraped from <200
https://www.finviz.com/screener.ashx?v=111&f=geo_bermuda>
{'effective_tax_rate_ttm_company': [], 'effective_tax_rate_ttm_industry':
[], 'effective_tax_rate_ttm_sector': [],
'effective_tax_rate_5_yr_avg_company': [],
'effective_tax_rate_5_yr_avg_industry': [],
'effective_tax_rate_5_yr_avg_sector': []}
2019-03-06 18:45:25 [scrapy.core.scraper] DEBUG: Scraped from <200
https://www.finviz.com/screener.ashx?v=111&f=geo_china>
{'effective_tax_rate_ttm_company': [], 'effective_tax_rate_ttm_industry':
[], 'effective_tax_rate_ttm_sector': [],
'effective_tax_rate_5_yr_avg_company': [],
'effective_tax_rate_5_yr_avg_industry': [],
'effective_tax_rate_5_yr_avg_sector': []}
*** EDIT 3
Managed to actually have the spider travel to url2 and yield the items there. The problem is it only does it rarely. Most of the time it redirects to the correct link and gets nothing, or doesn't seem to redirect at all and continues on. Not really sure why there is such inconsistency here.
2019-03-06 20:11:57 [scrapy.core.scraper] DEBUG: Scraped from <200
https://www.reuters.com/finance/stocks/financial-highlights/BCACU.A>
{'effective_tax_rate_ttm_company': ['--'],
'effective_tax_rate_ttm_industry': ['4.63'],
'effective_tax_rate_ttm_sector': ['20.97'],
'effective_tax_rate_5_yr_avg_company': ['--'],
'effective_tax_rate_5_yr_avg_industry': ['3.98'],
'effective_tax_rate_5_yr_avg_sector': ['20.77']}
The other thing is, I know I've managed to yield a few values on url2 succesfully though they don't appear in my CSV output. I realize this could be a export issue. I updated my code to how it is currently.
url2 is a relative path, but scrapy.Request expects a full URL.
Try this:
yield Request(
response.urljoin(url2),
callback=self.parse_dir_stats)
Or even simpler:
yield response.follow(url2, callback=self.parse_dir_stats)
Started with python and scraping a week ago, so please go easy people. So the
def parse() has a request which calls parseCa1L1() and parseCatL1() has a request which calls detailCollector(). The yield request in the parseCatL1() is not taking it to detailCollector().
def parse(self, response):
self.log('url entered is :' + response.url)
item = BasicItem()
for mainCat in response.css('ul.mm-listview > li'):
item['title'] = mainCat.css('a > span.meditle1::text').extract_first()
item['url'] = mainCat.css('a::attr(href)').extract_first()
item['url6'] = item['url']
request = scrapy.Request(item['url'], callback=self.parseCatL1)
request.meta['item'] = item['url6']
request.meta['item'] = item
yield request
def parseCatL1(self, response):
self.log('l1 url entered is :' + response.url)
item = response.meta['item']
item['listing'] = response.css('ul.mm-listview > li')
if item['listing'] == []:
print 'l1 if entered'
item['url6'] = item['url']
request = scrapy.Request(item['url6'], callback=self.detailCollector)
request.meta['item'] = item['url6']
print request
yield request
def detailCollector(self, response):
print 'detail collector entered'
and print request in parseCatL1() prints <GET https://www.some.com>
and if it is yielding request then why is it not printing detail collector entered
Below is the logs:
DEBUG: Crawled (200) https://www.justdial.com/Bangalore/AC-Compressor-Dealers/nct-10002128> (referer: https://www.justdial.com/Bangalore/311/11060105_3/AC-Compressor-Dealers_b2c)
DEBUG: l1 url entered is :https://www.justdial.com/Bangalore/AC-Compressor-Dealers/nct-10002128
l1 if entered
https://www.justdial.com/Bangalore/AC-Compressor-Dealers-Tecumseh/nct-10002136
https://www.justdial.com/Bangalore/AC-Compressor-Dealers-Tecumseh/nct-10002136>
DEBUG: Filtered duplicate request: https://www.justdial.com/Bangalore/AC-Compressor-Dealers-Tecumseh/nct-10002136> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
DEBUG: Crawled (200) https://www.justdial.com/Bangalore/AC-Compressor-Dealers-Daikin/nct-10002130> (referer: https://www.justdial.com/Bangalore/311/11060105_3/AC-Compressor-Dealers_b2c)
DEBUG: l1 url entered is :https://www.justdial.com/Bangalore/AC-Compressor-Dealers-Daikin/nct-10002130
l1 if entered
i'm trying to make my spider go over a list and scrape all the url's it can find following them scraping some data and returning to continue on the next unscraped link if i run the spider i can see that it returns back to the starting page but tries to scrape the same page again and just quits afterwards any code suggestions pretty new to python.
import scrapy
import re
from production.items import ProductionItem, ListResidentialItem
class productionSpider(scrapy.Spider):
name = "production"
allowed_domains = ["domain.com"]
start_urls = [
"http://domain.com/list"
]
def parse(self, response):
for sel in response.xpath('//html/body'):
item = ProductionItem()
item['listurl'] = sel.xpath('//a[#id="link101"]/#href').extract()[0]
request = scrapy.Request(item['listurl'], callback=self.parseBasicListingInfo)
yield request
def parseBasicListingInfo(item, response):
item = ListResidentialItem()
item['title'] = response.xpath('//span[#class="detail"]/text()').extract()
return item
to clarify:
i'm passing [0] so it only takes the first link of the list
but i want it to continue using the next unscraped link
output after running the spider :
2016-07-18 12:11:20 [scrapy] DEBUG: Crawled (200) <GET http://www.domain.com/robots.txt> (referer: None)
2016-07-18 12:11:20 [scrapy] DEBUG: Crawled (200) <GET http://www.domain.com/list> (referer: None)
2016-07-18 12:11:21 [scrapy] DEBUG: Crawled (200) <GET http://www.domain.com/link1> (referer: http://www.domain.com/list)
2016-07-18 12:11:21 [scrapy] DEBUG: Scraped from <200 http://www.domain.com/link1>
{'title': [u'\rlink1\r']}
This should just work fine. Change the domain and xpath and see
import scrapy
import re
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class ProdItems(scrapy.Item):
listurl = scrapy.Field()
title = scrapy.Field()
class productionSpider(scrapy.Spider):
name = "production"
allowed_domains = ["domain.com"]
start_urls = [
"http://domain.com/list"
]
def parse(self, response):
for sel in response.xpath('//html/body'):
item = ProductionItem()
list_urls = sel.xpath('//a[#id="link101"]/#href').extract()
for url in list_urls:
item['listurl'] = url
yield scrapy.Request(url, callback=self.parseBasicListingInfo, meta={'item': item})
def parseBasicListingInfo(item, response):
item = response.request.meta['item']
item['title'] = response.xpath('//span[#class="detail"]/text()').extract()
yield item
This is the line that's causing your problem:
item['listurl'] = sel.xpath('//a[#id="link101"]/#href').extract()[0]
The "//" means "from the start of the document" which means that it scans from the very first tag and will always find the same first link. What you need to do is search relative to the start of the current tag using ".//" which means "from this tag onwards". Also your current for loop is visiting every tag in the document which is unneccesary. Try this:
def parse(self, response):
for href in response.xpath('//a[#id="link101"]/#href').extract():
item = ProductionItem()
item['listurl'] = href
yield scrapy.Request(href,callback=self.parseBasicListingInfo, meta={'item': item})
The xpath pulls the hrefs out of the links and returns them as a list you can iterate over.