I'm doing a small scrape project and everything is working fine, but I'm having a problem with the order of links since Scrapy is synchronous. The 'rankings["Men's Pound-for-Pound"]' is a list of links which I except to be followed on its order, so the output will be in order as well.
Here's my code:
class FighterSpiderSpider(scrapy.Spider):
name = 'fighter_spider'
allowed_domains = ['www.ufc.com.br']
start_urls = ['https://www.ufc.com.br/rankings']
def parse(self, response):
all_rankings = response.css('div.view-grouping').getall() # --> list of all rankings
champions = {Selector(text=x).css('div.view-grouping div.info h4::text').get().strip() : Selector(text=x).css('a::attr(href)').get() for x in all_rankings}
rankings = {Selector(text=x).css('div.info h4::text').get().strip() : Selector(text=x).css('a::attr(href)').getall() for x in all_rankings}
if self.ranking == "p4p male":
for link in rankings["Men's Pound-for-Pound"]:
yield response.follow(link, callback=self.parse_date)
So there is no way to guarantee that the responses/output will be processed in a specific order. You can manually set the priority for each request which will influence the order in which requests are dispatched from the engine, but it will not guarantee that each response will be processed in the same order.
You can set the priority for requests by simply setting the priority parameter in your requests or response.follow calls.
for i, link in enumerate(rankings["Men's Pound-for-Pound"]):
yield response.follow(link, callback=self.parse_date, priority=len(rankings["Men's Pound-for-Pound"])) - i)
The higher the value set, the sooner it will be processed.
Since this doesn't guarantee the output ordering though I would suggest simply passing the rank as a callback keyword argument with the request and then sorting the output in a pipeline or postprocessing procedure.
For example:
class FighterSpiderSpider(scrapy.Spider):
name = 'fighter_spider'
allowed_domains = ['www.ufc.com.br']
start_urls = ['https://www.ufc.com.br/rankings']
def parse(self, response):
all_rankings = response.css('div.view-grouping').getall() # --> list of all rankings
champions = {Selector(text=x).css('div.view-grouping div.info h4::text').get().strip() : Selector(text=x).css('a::attr(href)').get() for x in all_rankings}
rankings = {Selector(text=x).css('div.info h4::text').get().strip() : Selector(text=x).css('a::attr(href)').getall() for x in all_rankings}
if self.ranking == "p4p male":
for i, link in enumerate(rankings["Men's Pound-for-Pound"]):
yield response.follow(link, callback=self.parse_date, cb_kwargs={"rank": i+1})
def parse_date(self, response, rank):
...
...
yield {'rank': rank ...}
Then you can sort the output into the correct order in a pipeline or post processsing.
Related
I'm trying to make a spider that gets some outdated urls from database, parses it and updates data in database. I need to get urls to scrape and ids to use it pipeline that saves the scraped data.
I made this code, but I don't know why scrapy changes the order of scraped links, looks like its random, so my code assing ids wrong. How can I assing id for every link?
def start_requests(self):
urls = self.get_urls_from_database()
# urls looks like [('link1', 1), ('link2', 2), ('link3', 3)]
for url in urls:
# url ('link1', 1)
self.links_ids.append(url[1])
yield scrapy.Request(url=url[0], callback=self.parse, dont_filter=True)
def get_urls_from_database(self):
self.create_connection()
self.dbcursor.execute("""SELECT link, id FROM urls_table""")
urls = self.dbcursor.fetchall()
return urls
def parse(self, response):
item = ScrapyItem()
link_id = self.links_ids[0]
self.links_ids.remove(link_id)
...
item['name'] = name
item['price'] = price
item['price_currency'] = price_currency
item['link_id'] = link_id
yield item
Because the links are not processed in order output is assinged to wrong item in database:
Name of item 1 is saved as name of item 3, price of item 8 is price of item 1 etc.
async
Scrapy appears to be scheduling GETs asynchronously.
Your code does not deal gracefully with that.
naming
What you get from the DB is not urls,
but rather rows or pairs.
Rather than writing:
for url in urls:
and using [0] or [1] subscripts,
it would be more pythonic to unpack the two items:
for url, id in pairs:
url → id
You attempt to recover an ID in this way:
link_id = self.links_ids[0]
Consider storing DB results in a dict
rather than a list:
for url, id in pairs:
self.url_to_id[url] = id
Then later you can just look up the required ID
with link_id = self.url_to_id[url].
iterating
Ok, let's see what was happening in this loop:
for url in urls:
self.links_ids.append(url[1])
yield scrapy.Request(url=url[0], callback=self.parse, dont_filter=True)
Within that loop you wind up running this line:
self.links_ids.remove(link_id)
It appears you're trying to use
a list, that has either zero or one elements,
as a scalar variable,
at least in a setting where Scrapy behaves synchronously.
That is an odd usage; using e.g. the dict I suggested
would probably make you happier.
Furthermore, your code assumes callbacks will happen
in the sequence they were enqueued;
this is not the case.
A dict would sort out that difficulty for you.
This piece of code is expected to add extracted reviewId into a set( in order to omit duplicates. Then there is a check, when set lenth is 100 - callback is executed and long url string with all ids is passed to main extract function.
How do i do this(Save all ids, extracted from different callbacks into same Set and use it further) either with built in tools or with the code i have? the problem now is that lenth check loop is never enetered.
UPdate. I believe there are two options - pass Set as meta to each callback and somehow use Item for this one. But donno how.
import scrapy
from scrapy.shell import inspect_response
class QuotesSpider(scrapy.Spider):
name = "tripad"
list= set()
def start_requests(self):
url = "https://www.tripadvisor.com/Hotel_Review-g60763-d122005-Reviews-or{}-The_New_Yorker_A_Wyndham_Hotel-New_York_City_New_York.html#REVIEWS"
for i in range(0,500,5):
yield scrapy.Request(url=url.format(i), callback=self.parse)
def parse(self, response):
for result in response.xpath('//div[contains(#id,"review_")]/#id').extract():
if "review" in result[:8]:
QuotesSpider.list.add(result[7:] +"%2C")
if len(QuotesSpider.list) == 100:
url = "https://www.tripadvisor.com/OverlayWidgetAjax?Mode=EXPANDED_HOTEL_REVIEWS&metaReferer=Hotel_Review&reviews="
for i in QuotesSpider.list:
url+=i
yield scrapy.Request(url=url, callback=self.parse_page)
There are several ways of doing this, however I'd advise splitting your spider into two parts:
Spider that collects review ids
class CollectorSpider(Spider):
name='collect_reviews'
def parse(self, response):
review_ids = ...
for review_id in review_ids:
yield {'review_id': review_id}
Spider that uses collected review ids to collect review content
class ConsumerSpider(Spider):
name='consume_reviews'
def start_requests(self):
with open(self.file, 'r') as f:
data = json.loads(f.read())
last = 0
for i in range(0, len(data), 100):
ids = data[last:i]
ids = [i['review_id'] for i in ids]
# make url from ids
url = ''
yield Request(url)
def parse(self, response):
# crawl 100 reviews here
I am trying to write some code to scrap the website of a UK housebuilder to record a list of houses for sale.
I am starting on the page http://www.persimmonhomes.com/sitemap and I have written one part of the code to list all the urls of the housebuilder developments and then the second part of the code to scrap from each of the urls to record prices etc.
I know the second part works and I know that the first part lists out all the urls. But for some reason the urls listed by the first part don't seem want to trigger the second part of the code to scrap from them.
The code of this first part is:
def parse(self, response):
for href in response.xpath('//*[#class="contacts-item"]/ul/li/a/#href'):
url = urlparse.urljoin('http://www.persimmonhomes.com/',href.extract())
yield scrapy.Request(url, callback=self.parse_dir_contents)
Now, I know this lists the urls I want (if I put in the line "print url" then they all get listed) and I can manually list add them to the code to run the second part all ok if I wanted to. However, even though the urls are created they do not seem to allow the second part of the code to scrap from them.
and the entire code is below:
import scrapy
import urlparse
from Persimmon.items import PersimmonItem
class persimmonSpider(scrapy.Spider):
name = "persimmon"
allowed_domains = ["http://www.persimmonhomes.com/"]
start_urls = [
"http://www.persimmonhomes.com/sitemap",
]
def parse(self, response):
for href in response.xpath('//*[#class="contacts-item"]/ul/li/a/#href'):
url = urlparse.urljoin('http://www.persimmonhomes.com/',href.extract())
yield scrapy.Request(url, callback=self.parse_dir_contents)
def parse_dir_contents(self, response):
for sel in response.xpath('//*[#id="aspnetForm"]/div[4]'):
item = PersimmonItem()
item['name'] = sel.xpath('//*[#id="aspnetForm"]/div[4]/div[1]/div[1]/div/div[2]/span/text()').extract()
item['address'] = sel.xpath('//*[#id="XplodePage_ctl12_dsDetailsSnippet_pDetailsContainer"]/div/*[#itemprop="postalCode"]/text()').extract()
plotnames = sel.xpath('//div[#class="housetype js-filter-housetype"]/div[#class="housetype__col-2"]/div[#class="housetype__plots"]/div[not(contains(#data-status,"Sold"))]/div[#class="plot__name"]/a/text()').extract()
plotnames = [plotname.strip() for plotname in plotnames]
plotids = sel.xpath('//div[#class="housetype js-filter-housetype"]/div[#class="housetype__col-2"]/div[#class="housetype__plots"]/div[not(contains(#data-status,"Sold"))]/div[#class="plot__name"]/a/#href').extract()
plotids = [plotid.strip() for plotid in plotids]
plotprices = sel.xpath('//div[#class="housetype js-filter-housetype"]/div[#class="housetype__col-2"]/div[#class="housetype__plots"]/div[not(contains(#data-status,"Sold"))]/div[#class="plot__price"]/text()').extract()
plotprices = [plotprice.strip() for plotprice in plotprices]
result = zip(plotnames, plotids, plotprices)
for plotname, plotid, plotprices in result:
item['plotname'] = plotname
item['plotid'] = plotid
item['plotprice'] = plotprice
yield item
any views as to why the first part of the code creates the urls but the second part does not loop through them?
You just need to fix your allowed_domains property:
allowed_domains = ["persimmonhomes.com"]
(tested - worked for me).
I am working on a class project and trying to get all IMDB movie data (titles, budgets. etc.) up until 2016. I adopted the code from https://github.com/alexwhb/IMDB-spider/blob/master/tutorial/spiders/spider.py.
My thought is: from i in range(1874,2016) (since 1874 is the earliest year shown on http://www.imdb.com/year/), direct the program to the corresponding year's website, and grab the data from that url.
But the problem is, each page for each year only show 50 movies, so after crawling the 50 movies, how can I move on to the next page? And after crawling each year, how can I move on to next year? This is my code for the parsing url part so far, but it is only able to crawls 50 movies for a particular year.
class tutorialSpider(scrapy.Spider):
name = "tutorial"
allowed_domains = ["imdb.com"]
start_urls = ["http://www.imdb.com/search/title?year=2014,2014&title_type=feature&sort=moviemeter,asc"]
def parse(self, response):
for sel in response.xpath("//*[#class='results']/tr/td[3]"):
item = MovieItem()
item['Title'] = sel.xpath('a/text()').extract()[0]
item['MianPageUrl']= "http://imdb.com"+sel.xpath('a/#href').extract()[0]
request = scrapy.Request(item['MianPageUrl'], callback=self.parseMovieDetails)
request.meta['item'] = item
yield request
You can use CrawlSpiders to simplify your task. As you'll see below, start_requests dynamically generates the list of URLs while parse_page only extracts the movies to crawl. Finding and following the 'Next' link is done by the rules attribute.
I agree with #Padraic Cunningham that hard-coding values is not a great idea. I've added spider arguments so that you can call:
scrapy crawl imdb -a start=1950 -a end=1980 (the scraper will default to 1874-2016 if it doesn't get any arguments).
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from imdbyear.items import MovieItem
class IMDBSpider(CrawlSpider):
name = 'imdb'
rules = (
# extract links at the bottom of the page. note that there are 'Prev' and 'Next'
# links, so a bit of additional filtering is needed
Rule(LinkExtractor(restrict_xpaths=('//*[#id="right"]/span/a')),
process_links=lambda links: filter(lambda l: 'Next' in l.text, links),
callback='parse_page',
follow=True),
)
def __init__(self, start=None, end=None, *args, **kwargs):
super(IMDBSpider, self).__init__(*args, **kwargs)
self.start_year = int(start) if start else 1874
self.end_year = int(end) if end else 2016
# generate start_urls dynamically
def start_requests(self):
for year in range(self.start_year, self.end_year+1):
yield scrapy.Request('http://www.imdb.com/search/title?year=%d,%d&title_type=feature&sort=moviemeter,asc' % (year, year))
def parse_page(self, response):
for sel in response.xpath("//*[#class='results']/tr/td[3]"):
item = MovieItem()
item['Title'] = sel.xpath('a/text()').extract()[0]
# note -- you had 'MianPageUrl' as your scrapy field name. I would recommend fixing this typo
# (you will need to change it in items.py as well)
item['MainPageUrl']= "http://imdb.com"+sel.xpath('a/#href').extract()[0]
request = scrapy.Request(item['MainPageUrl'], callback=self.parseMovieDetails)
request.meta['item'] = item
yield request
# make sure that the dynamically generated start_urls are parsed as well
parse_start_url = parse_page
# do your magic
def parseMovieDetails(self, response):
pass
you can use the below piece of code to follow the next page
#'a.lister-page-next.next-page::attr(href)' is the selector to get the next page link
next_page = response.css('a.lister-page-next.nextpage::attr(href)').extract_first() # joins current and next page url
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse) # calls parse function again when crawled to next page
I figured out a very dumb way to solve this. I put all the links in the start_urls. Better solution would be very much appreciated!
class tutorialSpider(scrapy.Spider):
name = "tutorial"
allowed_domains = ["imdb.com"]
start_urls = []
for i in xrange(1874, 2017):
for j in xrange(1, 11501, 50):
# since the largest number of movies for a year to have is 11,400 (2016)
start_url = "http://www.imdb.com/search/title?sort=moviemeter,asc&start=" + str(j) + "&title_type=feature&year=" + str(i) + "," + str(i)
start_urls.append(start_url)
def parse(self, response):
for sel in response.xpath("//*[#class='results']/tr/td[3]"):
item = MovieItem()
item['Title'] = sel.xpath('a/text()').extract()[0]
item['MianPageUrl']= "http://imdb.com"+sel.xpath('a/#href').extract()[0]
request = scrapy.Request(item['MianPageUrl'], callback=self.parseMovieDetails)
request.meta['item'] = item
yield request
The code that #Greg Sadetsky has provided needs some minor changes. Well only one change that is in the first line of parse_page method.
Just change xpath in the for loop from:
response.xpath("//*[#class='results']/tr/td[3]"):
to
response.xpath("//*[contains(#class,'lister-item-content')]/h3"):
This worked like a charm for me!
I'm writing a spider (CrawlSpider) for an online store. According to client requisites, I need to write two rules: one for determining which pages have items and other for extracting the items.
I have both rules already working independently:
if my start_urls = ["www.example.com/books.php",
"www.example.com/movies.php"] and I comment the Rule and the code
of parse_category, my parse_item will extract every item.
On the other hand, if start_urls = "http://www.example.com" and I
comment the Ruleand the code of parse_item, parse_category will
return every link in which there a items for extracting, i.e.
parse_category will return www.example.com/books.php and
www.example.com/movies.php.
My problem is that I don't know how to merge both modules, so that start_urls = "http://www.example.com" and then parse_category extracts www.example.com/books.php and www.example.com/movies.php and feed those links to parse_item, where I actually extract the info of each item.
I need to find a way to do it this way instead of just using start_urls = ["www.example.com/books.php", "www.example.com/movies.php"] because if in the future a new category is added (e.g. www.example.com/music.php), the spider wouldn't be able to automatically detect that new category and should be manually edited. Not a big deal, but the client doesn't want this.
class StoreSpider (CrawlSpider):
name = "storyder"
allowed_domains = ["example.com"]
start_urls = ["http://www.example.com/"]
#start_urls = ["http://www.example.com/books.php", "http://www.example.com/movies.php"]
rules = (
Rule(LinkExtractor(), follow=True, callback='parse_category'),
Rule(LinkExtractor(), follow=False, callback="parse_item"),
)
def parse_category(self, response):
category = StoreCategory()
# some code for determining whether the current page is a category, or just another stuff
if is a category:
category['name'] = name
category['url'] = response.url
return category
def parse_item(self, response):
item = StoreItem()
# some code for extracting the item's data
return item
the CrawlSpider rules don't work like you want, you'll need to implement the logic by yourself. when you specify follow=True you can't use callback, because the idea is to keep getting links (no items) while following the rules, check the documentation
you could try with something like:
class StoreSpider (CrawlSpider):
name = "storyder"
allowed_domains = ["example.com"]
start_urls = ["http://www.example.com/"]
# no rules
def parse(self, response): # this is parse_category
category_le = LinkExtractor("something for categories")
for a in category_le.extract_links(response):
yield Request(a.url, callback=self.parse_category)
item_le = LinkExtractor("something for items")
for a in item_le.extract_links(response):
yield Request(a.url, callback=self.parse_item)
def parse_category(self, response):
category = StoreCategory()
# some code for determining whether the current page is a category, or just another stuff
if is a category:
category['name'] = name
category['url'] = response.url
yield category
for req in self.parse(response):
yield req
def parse_item(self, response):
item = StoreItem()
# some code for extracting the item's data
return item
Instead of using a parse_category, I used restrict_css in LinkExtractorto get the links I want, and it seems to be feeding the second Rule with the extracted links, so my question is answered. It ended up this way:
class StoreSpider (CrawlSpider):
name = "storyder"
allowed_domains = ["example.com"]
start_urls = ["http://www.example.com/"]
rules = (
Rule(LinkExtractor(restrict_css=("#movies", "#books"))),
Rule(LinkExtractor(), callback="parse_item"),
)
def parse_item(self, response):
item = StoreItem()
# some code for extracting the item's data
return item
Still it can't detect new added categories (and there is not a clear pattern for using in restrict_css without fetching other garbage), but at least it's complying with the requisites of the client: 2 rules, one for extracting category's links and other for extracting item's data.