Crawling too many links at the same time

Crawling too many links at the same time - python

I'm trying to crawl a website and my spider (I don't know why) is crawling my links in such disorder!!
It's crawling all the links I want but it stored only the first one (rank and url_seller as example after)... I'm new in this universe of crawling, python or scrapy but all I want is to learn!! I post my code here, somebody could help me ?
# -*- coding: utf-8 -*-
import scrapy
import re
import numbers
from MarketplacePulse.items import MarketplacepulseItem
import urllib.parse
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class MarketplacePulseSpider(CrawlSpider):
name = 'MP_A_UK'
allowed_domains = ['marketplacepulse.com', 'amazon.co.uk']
start_urls = ['https://www.marketplacepulse.com/amazon/uk']
def parse(self, response):
item = MarketplacepulseItem()
rank = response.xpath('//div/table/tbody/tr/td[#class="number"]/text()').extract()
print('\n', rank, '\n')
url_1 = response.xpath('//div/table/tbody/tr/td/a/#href').extract()
print('\n', url_1, '\n')
for i in range(len(rank)-2):
item['month_rank'] = ''.join(rank[i]).strip()
item['year_rank'] = ''.join(rank[i+1]).strip()
item['lifetime_rank'] = ''.join(rank[i+2]).strip()
i += 3
for i in range(len(url_1)):
url_tmp = urllib.parse.urljoin('https://www.marketplacepulse.com',url_1[i])
yield scrapy.Request(url_tmp, callback=self.parse_2, meta={'item': item})
def parse_2(self, response):
item = response.meta['item']
url_2 = response.xpath('//body/div/section/div/div/div/p/a[contains(text(), "Amazon.co.uk")]/#href').extract()
item['url_seller'] = ''.join(url_2).strip()
yield scrapy.Request(str(url_2), callback=self.parse_3, meta={'item': item})
def parse_3(self, response):
item = response.meta['item']
business_name = response.xpath('//div[#class="a-row a-spacing-medium"]/div[#class="a-column a-span6"]/ul[#class="a-unordered-list a-nostyle a-vertical"]/li//span[#class="a-list-item"]/span[.="Business Name:"]/following-sibling::text()').extract()
phone_number = response.xpath('//div[#class="a-column a-span6"]/ul[#class="a-unordered-list a-nostyle a-vertical"]/li//span[#class="a-list-item"]/span[.="Phone number:"]/following-sibling::text()').extract()
address = response.xpath('//div[#class="a-column a-span6"]/ul[#class="a-unordered-list a-nostyle a-vertical"]/li//span[span[contains(.,"Address:")]]/ul//li//text()').extract()
item['business_name'] = ''.join(business_name).strip()
item['phone_number'] = ''.join(phone_number).strip()
item['address'] = '\n'.join(address).strip()
yield item
I post also an example of what I want and of what I get... you'll see the problem I hope!!
What I want :
2017-07-18 11:28:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.co.uk/sp?_encoding=UTF8&asin=&isAmazonFulfilled=&isCBA=&marketplaceID=A1F83G8C2ARO7P&orderID=&seller=A7CL6GT0UVQKS&tab=&vasStoreID=>
{'address': '55740 Currant Rd\nMishawaka\nIndiana\n46545\nUS',
'business_name': 'Better World Books Marketplace Inc',
'lifetime_rank': '863',
'month_rank': '218',
'phone_number': '',
'url_seller': 'https://www.amazon.co.uk/gp/aag/main?seller=A7CL6GT0UVQKS&tag=mk4343k-21',
'year_rank': '100'}
2017-07-18 11:28:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.co.uk/sp?_encoding=UTF8&asin=&isAmazonFulfilled=&isCBA=&marketplaceID=A1F83G8C2ARO7P&orderID=&seller=W5VG5JB9GHYUG&tab=&vasStoreID=>
{'address': 'ROOM 919, BLOCK 2 West, SEG TECHNOLOGY PARK\n'
'SHENZHEN\n'
'GUANGDONG\n'
'518000\n'
'CN\n'
'FU TIAN QU QIAO XIANG LU HAO FENG YUAN 7 DONG 7A\n'
'SHENZHEN\n'
'GUANGDONG\n'
'518000\n'
'CN',
'business_name': 'MUDDER TECHNOLOGY CO., LTD',
'lifetime_rank': '3',
'month_rank': '28',
'phone_number': '86 18565729081',
'url_seller': 'https://www.amazon.co.uk/gp/aag/main?seller=W5VG5JB9GHYUG&tag=mk4343k-21',
'year_rank': '10'}
And what I get :
2017-07-18 11:28:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.co.uk/sp?_encoding=UTF8&asin=&isAmazonFulfilled=&isCBA=&marketplaceID=A1F83G8C2ARO7P&orderID=&seller=A20T907OQC02JJ&tab=&vasStoreID=>
{'address': '55740 Currant Rd\nMishawaka\nIndiana\n46545\nUS',
'business_name': 'Better World Books Marketplace Inc',
'lifetime_rank': '863',
'month_rank': '218',
'phone_number': '',
'url_seller': 'https://www.amazon.co.uk/gp/aag/main?seller=A7CL6GT0UVQKS&tag=mk4343k-21',
'year_rank': '100'}
2017-07-18 11:28:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.co.uk/sp?_encoding=UTF8&asin=&isAmazonFulfilled=&isCBA=&marketplaceID=A1F83G8C2ARO7P&orderID=&seller=A1XG2K8M6HRQZ8&tab=&vasStoreID=>
{'address': 'ROOM 919, BLOCK 2 West, SEG TECHNOLOGY PARK\n'
'SHENZHEN\n'
'GUANGDONG\n'
'518000\n'
'CN\n'
'FU TIAN QU QIAO XIANG LU HAO FENG YUAN 7 DONG 7A\n'
'SHENZHEN\n'
'GUANGDONG\n'
'518000\n'
'CN',
'business_name': 'MUDDER TECHNOLOGY CO., LTD',
'lifetime_rank': '863',
'month_rank': '218',
'phone_number': '86 18565729081',
'url_seller': 'https://www.amazon.co.uk/gp/aag/main?seller=A7CL6GT0UVQKS&tag=mk4343k-21',
'year_rank': '100'}
You can see that the url_seller are exactly the same and the rank (by month, year or lifetime) also... but I want them to be different..... And the url_seller is not the same that the link I crawled, but it has to be the same..... Any help please?

Ranking issues
I'll walk through it step by step:
You get a list of rank numbers:
rank = response.xpath('//div/table/tbody/tr/td[#class="number"]/text()').extract()
You get a list of URLs:
url_1 = response.xpath('//div/table/tbody/tr/td/a/#href').extract()
Here's where you go wrong:
for i in range(len(rank)-2):
item['month_rank'] = ''.join(rank[i]).strip()
item['year_rank'] = ''.join(rank[i+1]).strip()
item['lifetime_rank'] = ''.join(rank[i+2]).strip()
i += 3
Firstly, since you're using a for loop, your i variable is getting reset to the next item (in this case the next number) at the beginning of each loop, so you're looping through each one, not looping by threes. That i += 3 is doing nothing, sorry.
Anyway, the purpose of the loop appears to be to build the following structure:
{'month_rank': <rank>, 'year_rank': <rank>, 'lifetime_rank': <rank>}
So..., secondly, each time you run this loop, you overwrite the previous set of values without having done anything with them. Oops.
You then proceed to loop through your list of URLs, passing the last set of rankings your previous loop built, along with each url to your parse_2 function.
for i in range(len(url_1)):
url_tmp = urllib.parse.urljoin('https://www.marketplacepulse.com',url_1[i])
yield scrapy.Request(url_tmp, callback=self.parse_2, meta={'item': item})
You end up with each call to parse_2 having the same set of ranking data.
To fix this, you should deal with your your URL and it's associated rankings in the same loop:
for i in range(len(url_1)):
url_tmp = urllib.parse.urljoin('https://www.marketplacepulse.com',url_1[i])
item['month_rank'] = ''.join(rank[i*3]).strip()
item['year_rank'] = ''.join(rank[i*3+1]).strip()
item['lifetime_rank'] = ''.join(rank[i*3+2]).strip()
yield scrapy.Request(url_tmp, callback=self.parse_2, meta={'item': item})
That should fix your rank issues.
url_seller issue
I'm not too sure about the url_seller issue, because it seems like it should use the same url for both item['url_seller'] and its call to parse_3, and it seems like it's using the right info to call parse_3, but continuing to use the same information in item['url_seller'] over and over again.
I'm kind of going out on a limb here, since if I'm understanding the situation properly, both methods should (in the particular case that I think this is) make equal strings, but the only difference I've noticed so far is that for one you're using ''.join(url_2).strip() and for the other you're using str(url_2).
Since the part where you're using str(url_2) seems to be working properly where it's being used, perhaps you should try using it in the other spot too:
item['url_seller'] = str(url_2)

Related

Scrapy Parse function not passing found values to the parse_page2 function

I am trying to scrape playstation webstore to scrape title, gamelink from the main page and Price for each game from the second page. However when using callback function to parse_page2, all the returned items contain the title and item['link'] value of the most recent item. (last of us remastered )
My code Below:
class PsStoreSpider(scrapy.Spider):
name = 'psstore'
start_urls =['https://store.playstation.com/en-ie/pages/browse']
def parse(self, response):
item = PlaystationItem()
products = response.css('a.psw-link')
for product in products:
item['main_url'] = response.url
item['title'] = product.css('span.psw-t-body.psw-c-t-1.psw-t-truncate-2.psw-m-b-2::text').get()
item['link'] = 'https://store.playstation.com' + product.css('a.psw-link.psw-content-link').attrib['href']
link = 'https://store.playstation.com' + product.css('a.psw-link.psw-content-link').attrib['href']
request = Request(link, callback=self.parse_page2)
request.meta['item'] = item
yield request
def parse_page2(self, response):
item = response.meta['item']
item['price'] = response.css('span.psw-t-title-m::text').get()
item['other_url'] = response.url
yield item
And part of the output:
2022-05-09 19:54:16 [scrapy.core.scraper] DEBUG: Scraped from <200 https://store.playstation.com/en-ie/concept/229261>
{'link': 'https://store.playstation.com/en-ie/concept/228638',
'main_url': 'https://store.playstation.com/en-ie/pages/browse',
'other_url': 'https://store.playstation.com/en-ie/concept/229261',
'price': 'Free',
'title': 'The Last of Us™ Remastered'}
2022-05-09 19:54:16 [scrapy.core.scraper] DEBUG: Scraped from <200 https://store.playstation.com/en-ie/concept/232847>
{'link': 'https://store.playstation.com/en-ie/concept/228638',
'main_url': 'https://store.playstation.com/en-ie/pages/browse',
'other_url': 'https://store.playstation.com/en-ie/concept/232847',
'price': '€59.99',
'title': 'The Last of Us™ Remastered'}
2022-05-09 19:54:16 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://store.playstation.com/en-ie/concept/224802> (referer: https://store.playstation.com/en-ie/pages/browse)
2022-05-09 19:54:16 [scrapy.core.scraper] DEBUG: Scraped from <200 https://store.playstation.com/en-ie/concept/224802>
{'link': 'https://store.playstation.com/en-ie/concept/228638',
'main_url': 'https://store.playstation.com/en-ie/pages/browse',
'other_url': 'https://store.playstation.com/en-ie/concept/224802',
'price': '€29.99',
'title': 'The Last of Us™ Remastered'}
As you can see the price is correctly returned but title and link are taken from the last scraped object. What am I missing here?
Thanks

The thing is that you create your item at the beginning of your parse method and then update it over and over. That also means that you always pass the same item to parse_page2.
If you were to create your item in the for-loop you would get a new one in every iteration and should get the expected result.
Like this:
def parse(self, response):
products = response.css('a.psw-link')
for product in products:
item = PlaystationItem()
item['main_url'] = response.url
item['title'] = product.css('span.psw-t-body.psw-c-t-1.psw-t-truncate-2.psw-m-b-2::text').get()
item['link'] = 'https://store.playstation.com' + product.css('a.psw-link.psw-content-link').attrib['href']
link = 'https://store.playstation.com' + product.css('a.psw-link.psw-content-link').attrib['href']
request = Request(link, callback=self.parse_page2)
request.meta['item'] = item
yield request

Web Scraping: Empty / NA / Null entries when running the spider, correct entries in scrapy shell

Webscraping a website, I'm making a spider that crawls through the pagination for the newest videos, scraping metadata for each of the 32 videos per page.
Next is my code for the spider:
class NaughtySpider(scrapy.Spider):
name = "naughtyspider"
allowed_domains = ["example.com"]
max_pages = 3
# Start request
def start_requests(self):
for i in range(1, self.max_pages):
yield scrapy.Request('https://www.example.com/video?o=cm&page=%s' % i, callback=self.parse_video)
# First parsing method
def parse_video(self, response):
self.log('F i n i s h e d s c r a p i n g ' + response.url)
video_links = response.css('ul#videoCategory').css('li.videoBox').css('div.thumbnail-info-wrapper').css('span.title > a').css('::attr(href)') #Correct path, chooses 32 videos from page ignoring the links coming from ads
links_to_follow = video_links.extract()
for url in links_to_follow:
yield response.follow(url = url,
callback = self.parse_metadata)
# Second parsing method
def parse_metadata(self, response):
# Create a SelectorList of the course titles text
video_title = response.css('div.title-container > h1.title > span.inlineFree::text')
# Extract the text and strip it clean
video_title_ext = video_title.extract_first()
# Extract views
video_views = response.css('span.count::text').extract_first()
# Extract tags
video_tags = response.css('div.tagsWrapper a::text').extract()
# Extract Categories
video_categories = response.css('div.categoriesWrapper a::text').extract()
# Fill in the dictionary
yield {
'title': video_title_ext,
'views': video_views,
'tags': video_tags,
'categories': video_categories,
}
The thing is that almost half the entries end up being empty, with no title, views, tags or categories. Example of the log:
[scrapy.core.scraper] DEBUG: Scraped from <200 https://www.example.com/view_video.php?viewkey=ph5d594b093f8d6>
{'title': None, 'views': None, 'tags': [], 'categories': []}
But at the same time, if I fetch the very same link in the scrapy shell, and copy and paste the very same selector paths in the spider, it gives me the correct values:
In [4]: fetch('https://www.example.com/view_video.php?viewkey=ph5d594b093f8d6')
[scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.example.com/view_video.php?viewkey=ph5d594b093f8d6> (referer: None)
In [5]: response.css('div.tagsWrapper a::text').extract()
Out[5]: ['alday', '559', '+ ']
In [6]: response.css('span.count::text').extract_first()
Out[6]: '6'
Thanks in advance for any help.
Edit: Would I be correct in thinking that this is not a problem with my code but rather a limitation on the server to avoid being scraped?

The data such as the views, duration, ... seems to be called in by a HTML variable element <var> DATA </var>. For example, if you enter the following line in your scrapy shell, you should get the views.
response.xpath(".//var[#class='duration')")
Not sure if it will work, but it is worth the try.
P.s. I'll have to tell my wife it was for educational purposes..

Scrapy yielding items and requesting link consecutively

my spider starts on this page https://finviz.com/screener.ashx and visits every link in the table to yield some items on the other side. This worked perfectly fine. I then wanted to add another layer of depth by having my spider visit a link on the page it initially visits like so:
start_urls > url > url_2
The spider is supposed to visit "url", yield some items along the way, then visit "url_2" and yield a few more items and then move on to the next url from the start_url.
Here is my spider code:
import scrapy
from scrapy import Request
from dimstatistics.items import DimstatisticsItem
class StatisticsSpider(scrapy.Spider):
name = 'statistics'
def __init__(self):
self.start_urls = ['https://finviz.com/screener.ashx? v=111&f=ind_stocksonly&r=01']
npagesscreener = 1000
for i in range(1, npagesscreener + 1):
self.start_urls.append("https://finviz.com/screener.ashx? v=111&f=ind_stocksonly&r="+str(i)+"1")
def parse(self, response):
for href in response.xpath("//td[contains(#class, 'screener-body-table-nw')]/a/#href"):
url = "https://www.finviz.com/" + href.extract()
yield follow.Request(url, callback=self.parse_dir_contents)
def parse_dir_contents(self, response):
item = {}
item['statisticskey'] = response.xpath("//a[contains(#class, 'fullview-ticker')]//text()").extract()[0]
item['shares_outstanding'] = response.xpath("//table[contains(#class, 'snapshot-table2')]/tr/td/descendant::text()").extract()[9]
item['shares_float'] = response.xpath("//table[contains(#class, 'snapshot-table2')]/tr/td/descendant::text()").extract()[21]
item['short_float'] = response.xpath("//table[contains(#class, 'snapshot-table2')]/tr/td/descendant::text()").extract()[33]
item['short_ratio'] = response.xpath("//table[contains(#class, 'snapshot-table2')]/tr/td/descendant::text()").extract()[45]
item['institutional_ownership'] = response.xpath("//table[contains(#class, 'snapshot-table2')]/tr/td/descendant::text()").extract()[7]
item['institutional_transactions'] = response.xpath("//table[contains(#class, 'snapshot-table2')]/tr/td/descendant::text()").extract()[19]
item['employees'] = response.xpath("//table[contains(#class, 'snapshot-table2')]/tr/td/descendant::text()").extract()[97]
item['recommendation'] = response.xpath("//table[contains(#class, 'snapshot-table2')]/tr/td/descendant::text()").extract()[133]
yield item
url2 = response.xpath("//table[contains(#class, 'fullview-links')]//a/#href").extract()[0]
yield response.follow(url2, callback=self.parse_dir_stats)
def parse_dir_stats(self, response):
item = {}
item['effective_tax_rate_ttm_company'] = response.xpath("//tr[td[normalize-space()='Effective Tax Rate (TTM)']]/td[2]/text()").extract()
item['effective_tax_rate_ttm_industry'] = response.xpath("//tr[td[normalize-space()='Effective Tax Rate (TTM)']]/td[3]/text()").extract()
item['effective_tax_rate_ttm_sector'] = response.xpath("//tr[td[normalize-space()='Effective Tax Rate (TTM)']]/td[4]/text()").extract()
item['effective_tax_rate_5_yr_avg_company'] = response.xpath("//tr[td[normalize-space()='Effective Tax Rate - 5 Yr. Avg.']]/td[2]/text()").extract()
item['effective_tax_rate_5_yr_avg_industry'] = response.xpath("//tr[td[normalize-space()='Effective Tax Rate - 5 Yr. Avg.']]/td[3]/text()").extract()
item['effective_tax_rate_5_yr_avg_sector'] = response.xpath("//tr[td[normalize-space()='Effective Tax Rate - 5 Yr. Avg.']]/td[4]/text()").extract()
yield item
All of the xpaths and links are right, I just can't seem to yield anything at all now. I have a feeling there is an obvious mistake here. My first try at a more elaborate spider.
Any help would be greatly appreciated! Thank you!
***EDIT 2
{'statisticskey': 'AMRB', 'shares_outstanding': '5.97M', 'shares_float':
'5.08M', 'short_float': '0.04%', 'short_ratio': '0.63',
'institutional_ownership': '10.50%', 'institutional_transactions': '2.74%',
'employees': '101', 'recommendation': '2.30'}
2019-03-06 18:45:19 [scrapy.core.scraper] DEBUG: Scraped from <200
https://www.finviz.com/quote.ashx?t=AMR&ty=c&p=d&b=1>
{'statisticskey': 'AMR', 'shares_outstanding': '154.26M', 'shares_float':
'89.29M', 'short_float': '13.99%', 'short_ratio': '4.32',
'institutional_ownership': '0.10%', 'institutional_transactions': '-',
'employees': '-', 'recommendation': '3.00'}
2019-03-06 18:45:19 [scrapy.core.scraper] DEBUG: Scraped from <200
https://www.finviz.com/quote.ashx?t=AMD&ty=c&p=d&b=1>
{'statisticskey': 'AMD', 'shares_outstanding': '1.00B', 'shares_float':
'997.92M', 'short_float': '11.62%', 'short_ratio': '1.27',
'institutional_ownership': '0.70%', 'institutional_transactions': '-83.83%',
'employees': '10100', 'recommendation': '2.50'}
2019-03-06 18:45:19 [scrapy.core.scraper] DEBUG: Scraped from <200
https://www.finviz.com/quote.ashx?t=AMCX&ty=c&p=d&b=1>
{'statisticskey': 'AMCX', 'shares_outstanding': '54.70M', 'shares_float':
'43.56M', 'short_float': '20.94%', 'short_ratio': '14.54',
'institutional_ownership': '3.29%', 'institutional_transactions': '0.00%',
'employees': '1872', 'recommendation': '3.00'}
2019-03-06 18:45:19 [scrapy.core.scraper] DEBUG: Scraped from <200
https://www.finviz.com/screener.ashx?v=111&f=geo_bermuda>
{'effective_tax_rate_ttm_company': [], 'effective_tax_rate_ttm_industry':
[], 'effective_tax_rate_ttm_sector': [],
'effective_tax_rate_5_yr_avg_company': [],
'effective_tax_rate_5_yr_avg_industry': [],
'effective_tax_rate_5_yr_avg_sector': []}
2019-03-06 18:45:25 [scrapy.core.scraper] DEBUG: Scraped from <200
https://www.finviz.com/screener.ashx?v=111&f=geo_china>
{'effective_tax_rate_ttm_company': [], 'effective_tax_rate_ttm_industry':
[], 'effective_tax_rate_ttm_sector': [],
'effective_tax_rate_5_yr_avg_company': [],
'effective_tax_rate_5_yr_avg_industry': [],
'effective_tax_rate_5_yr_avg_sector': []}
*** EDIT 3
Managed to actually have the spider travel to url2 and yield the items there. The problem is it only does it rarely. Most of the time it redirects to the correct link and gets nothing, or doesn't seem to redirect at all and continues on. Not really sure why there is such inconsistency here.
2019-03-06 20:11:57 [scrapy.core.scraper] DEBUG: Scraped from <200
https://www.reuters.com/finance/stocks/financial-highlights/BCACU.A>
{'effective_tax_rate_ttm_company': ['--'],
'effective_tax_rate_ttm_industry': ['4.63'],
'effective_tax_rate_ttm_sector': ['20.97'],
'effective_tax_rate_5_yr_avg_company': ['--'],
'effective_tax_rate_5_yr_avg_industry': ['3.98'],
'effective_tax_rate_5_yr_avg_sector': ['20.77']}
The other thing is, I know I've managed to yield a few values on url2 succesfully though they don't appear in my CSV output. I realize this could be a export issue. I updated my code to how it is currently.

url2 is a relative path, but scrapy.Request expects a full URL.
Try this:
yield Request(
response.urljoin(url2),
callback=self.parse_dir_stats)
Or even simpler:
yield response.follow(url2, callback=self.parse_dir_stats)

How to fix my Scrapy dictionary output format for CSV /JSON

My code is below. I'm looking to extract results to CSV. However, the scrapy results in a dictionary with 2 keys, and all the values lumped together in each key. The output does not look good.
How do I fix this. Can this be done through pipelines/itemloaders etc...
Thanks very much.
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy.loader import ItemLoader
from scrapy.loader.processors import TakeFirst, MapCompose, Join
from gumtree1.items import GumtreeItems
class AdItemLoader(ItemLoader):
jobs_in = MapCompose(unicode.strip)
class GumtreeEasySpider(CrawlSpider):
name = 'gumtree_easy'
allowed_domains = ['gumtree.com.au']
start_urls = ['http://www.gumtree.com.au/s-jobs/page-2/c9302?ad=offering']
rules = (
Rule(LinkExtractor(restrict_xpaths='//a[#class="rs-paginator-btn next"]'), callback='parse_item', follow=True),
)
def parse_item(self, response):
loader = AdItemLoader(item=GumtreeItems(), response=response)
loader.add_xpath('jobs','//div[#id="recent-sr-title"]/following-sibling::*//*[#itemprop="name"]/text()')
loader.add_xpath('location', '//div[#id="recent-sr-title"]/following-sibling::*//*[#class="rs-ad-location-area"]/text()')
yield loader.load_item()
The result:
2016-03-16 01:51:32 [scrapy] DEBUG: Scraped from <200 http://www.gumtree.com.au/s-jobs/page-5/c9302?ad=offering>
{'jobs': [u'Technical Account Manager',
u'Service & Maintenance Advisor',
u'we are hiring motorbike driver delivery leaflet.Strat NOW(BE...',
u'Casual Gardner/landscape maintenance labourer',
u'Seeking for Experienced Builders Cleaners with white card',
u'Babysitter / home help for approx 2 weeks',
u'Toothing brickwork | Dapto',
u'EXPERIENCED CHEF',
u'ChildCare Trainee Wanted',
u'Skilled Pipelayers & Drainer- Sydney Region',
u'Casual staff required for Royal Easter Show',
u'Fencing contractor',
u'Excavator & Loader Operator',
u'***EXPERIENCED STRAWBERRY AND RASPBERRY PICKERS WANTED***',
u'Kitchenhand required for Indian restaurant',
u'Taxi Driver Wanted',
u'Full time nanny/sitter',
u'Kitchen hand and meal packing',
u'Depot Assistant Required',
u'hairdresser Junior apprentice required for salon in Randwick',
u'Insulation Installers Required',
u'The Knox is seeking a new apprentice',
u'Medical Receptionist Needed in Bankstown Area - Night Shifts',
u'On Call Easy Work, Do you live in Berala, Lidcombe or Auburn...',
u'Looking for farm jon'],
'location': [u'Melbourne City',
u'Eastern Suburbs',
u'Rockdale Area',
u'Logan Area',
u'Greater Dandenong',
u'Brisbane North East',
u'Kiama Area',
u'Byron Area',
u'Dardanup Area',
u'Blacktown Area',
u'Auburn Area',
u'Kingston Area',
u'Inner Sydney',
u'Northern Midlands',
u'Inner Sydney',
u'Hume Area',
u'Maribyrnong Area',
u'Perth City',
u'Brisbane South East',
u'Eastern Suburbs',
u'Gold Coast South',
u'North Canberra',
u'Bankstown Area',
u'Auburn Area',
u'Gingin Area']}
Should it be like this instead. Jobs and Location as individual Dicts? This writes correctly to CSV with Jobs and Location in separate cells but I find that using for loops and zip shoult not the best way.
import scrapy
from gumtree1.items import GumtreeItems
class AussieGum1Spider(scrapy.Spider):
name = "aussie_gum1"
allowed_domains = ["gumtree.com.au"]
start_urls = (
'http://www.gumtree.com.au/s-jobs/page-2/c9302?ad=offering',
)
def parse(self, response):
item = GumtreeItems()
jobs = response.xpath('//div[#id="recent-sr-title"]/following-sibling::*//*[#itemprop="name"]/text()').extract()
location = response.xpath('//div[#id="recent-sr-title"]/following-sibling::*//*[#class="rs-ad-location-area"]/text()').extract()
for j, l in zip(jobs, location):
item['jobs'] = j.strip()
item['location'] = l
yield item
partial results below.
2016-03-16 02:20:46 [scrapy] DEBUG: Crawled (200) <GET http://www.gumtree.com.au/s-jobs/page-3/c9302?ad=offering> (referer: http://www.gumtree.com.au/s-jobs/page-2/c9302?ad=offering)
2016-03-16 02:20:46 [scrapy] DEBUG: Scraped from <200 http://www.gumtree.com.au/s-jobs/page-3/c9302?ad=offering>
{'jobs': u'Live In Au pair-Urgent', 'location': u'Wanneroo Area'}
2016-03-16 02:20:46 [scrapy] DEBUG: Scraped from <200 http://www.gumtree.com.au/s-jobs/page-3/c9302?ad=offering>
{'jobs': u'live in carer', 'location': u'Fraser Coast'}
2016-03-16 02:20:46 [scrapy] DEBUG: Scraped from <200 http://www.gumtree.com.au/s-jobs/page-3/c9302?ad=offering>
{'jobs': u'Mental Health Nurse', 'location': u'Perth Region'}
2016-03-16 02:20:46 [scrapy] DEBUG: Scraped from <200 http://www.gumtree.com.au/s-jobs/page-3/c9302?ad=offering>
{'jobs': u'Experienced NBN pit and pipe installers/node and cabinet wor...',
'location': u'Marrickville Area'}
2016-03-16 02:20:46 [scrapy] DEBUG: Scraped from <200 http://www.gumtree.com.au/s-jobs/page-3/c9302?ad=offering>
{'jobs': u'Delivery Driver / Pizza Maker Job - Dominos Pizza',
'location': u'Hurstville Area'}
Thanks very much.

To be honest, using a for loop is the right way, but you can workaround it on a pipeline:
from scrapy.http import Response
from gumtree1.items import GumtreeItems, CustomItem
from scrapy.exceptions import DropItem
class CustomPipeline(object):
def __init__(self, crawler):
self.crawler = crawler
#classmethod
def from_crawler(cls, crawler):
return cls(crawler)
def process_item(self, item, spider):
if isinstance(item, GumtreeItems):
for i, jobs in enumerate(item['jobs']):
self.crawler.engine.scraper._process_spidermw_output(
CustomItem(jobs=jobs, location=item['location'][i]), None, Response(''), spider)
raise DropItem("main item dropped")
return item
also add the custom item:
class CustomItem(scrapy.Item):
jobs = scrapy.Field()
location = scrapy.Field()
Hope this helped, again I think you should use the loop.

Have a parent selector for every item and extract job and location relative to it:
rows = response.xpath('//div[#id="recent-sr-title"]/following-sibling::*')
for row in rows:
item = GumtreeItems()
item['jobs'] = row.xpath('.//*[#itemprop="name"]/text()').extract_first().strip()
item['location'] = row.xpath('.//*[#class="rs-ad-location-area"]/text()').extract_first().strip()
yield item

Scrapy - Crawl and Scrape a website

As a part of learning to use Scrapy, I have tried to Crawl Amazon and there is a problem while scraping data,
The output of my code is as follows:
2013-02-25 12:47:21+0530 [scanon] DEBUG: Scraped from <200 http://www.amazon.com/s/ref=sr_pg_2?ie=UTF8&page=2&qid=1361774681&rh=n%3A283155>
{'link': [u'http://www.amazon.com/ObamaCare-Survival-Guide-Nick-Tate/dp/0893348627/ref=sr_1_13?s=books&ie=UTF8&qid=1361774694&sr=1-13',
u'http://www.amazon.com/MELT-Method-Breakthrough-Self-Treatment-Eliminate/dp/0062065351/ref=sr_1_14?s=books&ie=UTF8&qid=1361774694&sr=1-14',
u'http://www.amazon.com/Official-SAT-Study-Guide-2nd/dp/0874478529/ref=sr_1_15?s=books&ie=UTF8&qid=1361774694&sr=1-15',
u'http://www.amazon.com/Inferno-Robert-Langdon-Dan-Brown/dp/0385537859/ref=sr_1_16?s=books&ie=UTF8&qid=1361774694&sr=1-16',
u'http://www.amazon.com/Memory-Light-Wheel-Time/dp/0765325950/ref=sr_1_17?s=books&ie=UTF8&qid=1361774694&sr=1-17',
u'http://www.amazon.com/Jesus-Calling-Enjoying-Peace-Presence/dp/1591451884/ref=sr_1_18?s=books&ie=UTF8&qid=1361774694&sr=1-18',
u'http://www.amazon.com/Fifty-Shades-Grey-Book-Trilogy/dp/0345803485/ref=sr_1_19?s=books&ie=UTF8&qid=1361774694&sr=1-19',
u'http://www.amazon.com/Fifty-Shades-Trilogy-Darker-3-/dp/034580404X/ref=sr_1_20?s=books&ie=UTF8&qid=1361774694&sr=1-20',
u'http://www.amazon.com/Wheat-Belly-Lose-Weight-Health/dp/1609611543/ref=sr_1_21?s=books&ie=UTF8&qid=1361774694&sr=1-21',
u'http://www.amazon.com/Publication-Manual-American-Psychological-Association/dp/1433805618/ref=sr_1_22?s=books&ie=UTF8&qid=1361774694&sr=1-22',
u'http://www.amazon.com/One-Only-Ivan-Katherine-Applegate/dp/0061992259/ref=sr_1_23?s=books&ie=UTF8&qid=1361774694&sr=1-23',
u'http://www.amazon.com/Inquebrantable-Spanish-Jenni-Rivera/dp/1476745420/ref=sr_1_24?s=books&ie=UTF8&qid=1361774694&sr=1-24'],
'title': [u'ObamaCare Survival Guide',
u'The Official SAT Study Guide, 2nd edition',
u'Inferno: A Novel (Robert Langdon)',
u'A Memory of Light (Wheel of Time)',
u'Jesus Calling: Enjoying Peace in His Presence',
u'Fifty Shades of Grey: Book One of the Fifty Shades Trilogy',
u'Fifty Shades Trilogy: Fifty Shades of Grey, Fifty Shades Darker, Fifty Shades Freed 3-volume Boxed Set',
u'Wheat Belly: Lose the Wheat, Lose the Weight, and Find Your Path Back to Health',
u'Publication Manual of the American Psychological Association, 6th Edition',
u'The One and Only Ivan',
u'Inquebrantable (Spanish Edition)'],
'visit_id': '2f4d045a9d6013ef4a7cbc6ed62dc111f6111633',
'visit_status': 'new'}
But, I wanted the output to be captured like this,
2013-02-25 12:47:21+0530 [scanon] DEBUG: Scraped from <200 http://www.amazon.com/s/ref=sr_pg_2?ie=UTF8&page=2&qid=1361774681&rh=n%3A283155>
{'link': [u'http://www.amazon.com/ObamaCare-Survival-Guide-Nick-Tate/dp/0893348627/ref=sr_1_13?s=books&ie=UTF8&qid=1361774694&sr=1-13'],
'title': [u'ObamaCare Survival Guide']}
2013-02-25 12:47:21+0530 [scanon] DEBUG: Scraped from <200 http://www.amazon.com/s/ref=sr_pg_2?ie=UTF8&page=2&qid=1361774681&rh=n%3A283155>
{'link': [u'http://www.amazon.com/Official-SAT-Study-Guide-2nd/dp/0874478529/ref=sr_1_15?s=books&ie=UTF8&qid=1361774694&sr=1-15'],
'title': [u'The Official SAT Study Guide, 2nd edition']}
I think its not a problem with the scrapy or the crawler, but with the FOR loop written.
Following is the code,
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from Amaze.items import AmazeItem
class AmazeSpider2(CrawlSpider):
name = "scanon"
allowed_domains = ["www.amazon.com"]
start_urls = ["http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=books"]
rules = (
Rule(SgmlLinkExtractor(allow=("ref=sr_pg_*")), callback="parse_items_1", follow= True),
)
def parse_items_1(self, response):
items = []
print ('*** response:', response.url)
hxs = HtmlXPathSelector(response)
titles = hxs.select('//h3')
for title in titles:
item = AmazeItem()
item["title"] = title.select('//a[#class="title"]/text()').extract()
item["link"] = title.select('//a[#class="title"]/#href').extract()
print ('**parse-items_1:', item["title"], item["link"])
items.append(item)
return items
Any assistance!

problem is in your Xpath
def parse_items_1(self, response):
items = []
print ('*** response:', response.url)
hxs = HtmlXPathSelector(response)
titles = hxs.select('//h3')
for title in titles:
item = AmazeItem()
item["title"] = title.select('.//a[#class="title"]/text()').extract()
item["link"] = title.select('.//a[#class="title"]/#href').extract()
print ('**parse-items_1:', item["title"], item["link"])
items.append(item)
return items
in above Xpaths you needs to use . in xpath to look into title only other wise your xpath will look on whole page , so it will get allot of matches and will return them,

By the way - you can test our your Xpath expressions in the Scrapy Shell - http://doc.scrapy.org/en/latest/topics/shell.html
Done right, it will save you hours of work and a headache. :)

Use yield to make a generator and fix your xpath selectors:
def parse_items_1(self, response):
hxs = HtmlXPathSelector(response)
titles = hxs.select('//h3')
for title in titles:
item = AmazeItem()
item["title"] = title.select('.//a[#class="title"]/text()').extract()
item["link"] = title.select('.//a[#class="title"]/#href').extract()
yield item

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Crawling too many links at the same time - python

Related

Scrapy Parse function not passing found values to the parse_page2 function

Web Scraping: Empty / NA / Null entries when running the spider, correct entries in scrapy shell

Scrapy yielding items and requesting link consecutively

How to fix my Scrapy dictionary output format for CSV /JSON

Scrapy - Crawl and Scrape a website

Categories

Resources