How to feed a spider with links crawled within the spider? - python

I'm writing a spider (CrawlSpider) for an online store. According to client requisites, I need to write two rules: one for determining which pages have items and other for extracting the items.
I have both rules already working independently:
if my start_urls = ["www.example.com/books.php",
"www.example.com/movies.php"] and I comment the Rule and the code
of parse_category, my parse_item will extract every item.
On the other hand, if start_urls = "http://www.example.com" and I
comment the Ruleand the code of parse_item, parse_category will
return every link in which there a items for extracting, i.e.
parse_category will return www.example.com/books.php and
www.example.com/movies.php.
My problem is that I don't know how to merge both modules, so that start_urls = "http://www.example.com" and then parse_category extracts www.example.com/books.php and www.example.com/movies.php and feed those links to parse_item, where I actually extract the info of each item.
I need to find a way to do it this way instead of just using start_urls = ["www.example.com/books.php", "www.example.com/movies.php"] because if in the future a new category is added (e.g. www.example.com/music.php), the spider wouldn't be able to automatically detect that new category and should be manually edited. Not a big deal, but the client doesn't want this.
class StoreSpider (CrawlSpider):
name = "storyder"
allowed_domains = ["example.com"]
start_urls = ["http://www.example.com/"]
#start_urls = ["http://www.example.com/books.php", "http://www.example.com/movies.php"]
rules = (
Rule(LinkExtractor(), follow=True, callback='parse_category'),
Rule(LinkExtractor(), follow=False, callback="parse_item"),
)
def parse_category(self, response):
category = StoreCategory()
# some code for determining whether the current page is a category, or just another stuff
if is a category:
category['name'] = name
category['url'] = response.url
return category
def parse_item(self, response):
item = StoreItem()
# some code for extracting the item's data
return item

the CrawlSpider rules don't work like you want, you'll need to implement the logic by yourself. when you specify follow=True you can't use callback, because the idea is to keep getting links (no items) while following the rules, check the documentation
you could try with something like:
class StoreSpider (CrawlSpider):
name = "storyder"
allowed_domains = ["example.com"]
start_urls = ["http://www.example.com/"]
# no rules
def parse(self, response): # this is parse_category
category_le = LinkExtractor("something for categories")
for a in category_le.extract_links(response):
yield Request(a.url, callback=self.parse_category)
item_le = LinkExtractor("something for items")
for a in item_le.extract_links(response):
yield Request(a.url, callback=self.parse_item)
def parse_category(self, response):
category = StoreCategory()
# some code for determining whether the current page is a category, or just another stuff
if is a category:
category['name'] = name
category['url'] = response.url
yield category
for req in self.parse(response):
yield req
def parse_item(self, response):
item = StoreItem()
# some code for extracting the item's data
return item

Instead of using a parse_category, I used restrict_css in LinkExtractorto get the links I want, and it seems to be feeding the second Rule with the extracted links, so my question is answered. It ended up this way:
class StoreSpider (CrawlSpider):
name = "storyder"
allowed_domains = ["example.com"]
start_urls = ["http://www.example.com/"]
rules = (
Rule(LinkExtractor(restrict_css=("#movies", "#books"))),
Rule(LinkExtractor(), callback="parse_item"),
)
def parse_item(self, response):
item = StoreItem()
# some code for extracting the item's data
return item
Still it can't detect new added categories (and there is not a clear pattern for using in restrict_css without fetching other garbage), but at least it's complying with the requisites of the client: 2 rules, one for extracting category's links and other for extracting item's data.

Related

Scrapy yield only last data and merge scrapy data into one

I am scraping some news website with scrapy framework, it seems only store the last item scraped and repeated in loop
I want to store the Title,Date,and Link, which i scrape from the first page
and also store the whole news article. So i want to merge the article which stored in a list into a single string.
Item code
import scrapy
class ScrapedItem(scrapy.Item):
# define the fields for your item here like:
title = scrapy.Field()
source = scrapy.Field()
date = scrapy.Field()
paragraph = scrapy.Field()
Spider code
import scrapy
from ..items import ScrapedItem
class CBNCSpider(scrapy.Spider):
name = 'kontan'
start_urls = [
'https://investasi.kontan.co.id/rubrik/28/Emiten'
]
def parse(self, response):
box_text = response.xpath("//ul/li/div[#class='ket']")
items = ScrapedItem()
for crawl in box_text:
title = crawl.css("h1 a::text").extract()
source ="https://investasi.kontan.co.id"+(crawl.css("h1 a::attr(href)").extract()[0])
date = crawl.css("span.font-gray::text").extract()[0].replace("|","")
items['title'] = title
items['source'] =source
items['date'] = date
yield scrapy.Request(url = source,
callback=self.parseparagraph,
meta={'item':items})
def parseparagraph(self, response):
items_old = response.meta['item'] #only last item stored
paragraph = response.xpath("//p/text()").extract()
items_old['paragraph'] = paragraph #merge into single string
yield items_old
I expect the output that the Date,Title,and Source can be updated through the loop.
And the article can be merged into single string to be stored in mysql
I defined an empty dictionary and put those variables within it. Moreover, I've brought about some minor changes in your xpaths and css selectors to make them less error prone. The script is working as desired now:
import scrapy
class CBNCSpider(scrapy.Spider):
name = 'kontan'
start_urls = [
'https://investasi.kontan.co.id/rubrik/28/Emiten'
]
def parse(self, response):
for crawl in response.xpath("//*[#id='list-news']//*[#class='ket']"):
d = {}
d['title'] = crawl.css("h1 > a::text").get()
d['source'] = response.urljoin(crawl.css("h1 > a::attr(href)").get())
d['date'] = crawl.css("span.font-gray::text").get().strip("|")
yield scrapy.Request(
url=d['source'],
callback=self.parseparagraph,
meta={'item':d}
)
def parseparagraph(self, response):
items_old = response.meta['item']
items_old['paragraph'] = response.xpath("//p/text()").getall()
yield items_old

IMDB scrapy get all movie data

I am working on a class project and trying to get all IMDB movie data (titles, budgets. etc.) up until 2016. I adopted the code from https://github.com/alexwhb/IMDB-spider/blob/master/tutorial/spiders/spider.py.
My thought is: from i in range(1874,2016) (since 1874 is the earliest year shown on http://www.imdb.com/year/), direct the program to the corresponding year's website, and grab the data from that url.
But the problem is, each page for each year only show 50 movies, so after crawling the 50 movies, how can I move on to the next page? And after crawling each year, how can I move on to next year? This is my code for the parsing url part so far, but it is only able to crawls 50 movies for a particular year.
class tutorialSpider(scrapy.Spider):
name = "tutorial"
allowed_domains = ["imdb.com"]
start_urls = ["http://www.imdb.com/search/title?year=2014,2014&title_type=feature&sort=moviemeter,asc"]
def parse(self, response):
for sel in response.xpath("//*[#class='results']/tr/td[3]"):
item = MovieItem()
item['Title'] = sel.xpath('a/text()').extract()[0]
item['MianPageUrl']= "http://imdb.com"+sel.xpath('a/#href').extract()[0]
request = scrapy.Request(item['MianPageUrl'], callback=self.parseMovieDetails)
request.meta['item'] = item
yield request
You can use CrawlSpiders to simplify your task. As you'll see below, start_requests dynamically generates the list of URLs while parse_page only extracts the movies to crawl. Finding and following the 'Next' link is done by the rules attribute.
I agree with #Padraic Cunningham that hard-coding values is not a great idea. I've added spider arguments so that you can call:
scrapy crawl imdb -a start=1950 -a end=1980 (the scraper will default to 1874-2016 if it doesn't get any arguments).
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from imdbyear.items import MovieItem
class IMDBSpider(CrawlSpider):
name = 'imdb'
rules = (
# extract links at the bottom of the page. note that there are 'Prev' and 'Next'
# links, so a bit of additional filtering is needed
Rule(LinkExtractor(restrict_xpaths=('//*[#id="right"]/span/a')),
process_links=lambda links: filter(lambda l: 'Next' in l.text, links),
callback='parse_page',
follow=True),
)
def __init__(self, start=None, end=None, *args, **kwargs):
super(IMDBSpider, self).__init__(*args, **kwargs)
self.start_year = int(start) if start else 1874
self.end_year = int(end) if end else 2016
# generate start_urls dynamically
def start_requests(self):
for year in range(self.start_year, self.end_year+1):
yield scrapy.Request('http://www.imdb.com/search/title?year=%d,%d&title_type=feature&sort=moviemeter,asc' % (year, year))
def parse_page(self, response):
for sel in response.xpath("//*[#class='results']/tr/td[3]"):
item = MovieItem()
item['Title'] = sel.xpath('a/text()').extract()[0]
# note -- you had 'MianPageUrl' as your scrapy field name. I would recommend fixing this typo
# (you will need to change it in items.py as well)
item['MainPageUrl']= "http://imdb.com"+sel.xpath('a/#href').extract()[0]
request = scrapy.Request(item['MainPageUrl'], callback=self.parseMovieDetails)
request.meta['item'] = item
yield request
# make sure that the dynamically generated start_urls are parsed as well
parse_start_url = parse_page
# do your magic
def parseMovieDetails(self, response):
pass
you can use the below piece of code to follow the next page
#'a.lister-page-next.next-page::attr(href)' is the selector to get the next page link
next_page = response.css('a.lister-page-next.nextpage::attr(href)').extract_first() # joins current and next page url
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse) # calls parse function again when crawled to next page
I figured out a very dumb way to solve this. I put all the links in the start_urls. Better solution would be very much appreciated!
class tutorialSpider(scrapy.Spider):
name = "tutorial"
allowed_domains = ["imdb.com"]
start_urls = []
for i in xrange(1874, 2017):
for j in xrange(1, 11501, 50):
# since the largest number of movies for a year to have is 11,400 (2016)
start_url = "http://www.imdb.com/search/title?sort=moviemeter,asc&start=" + str(j) + "&title_type=feature&year=" + str(i) + "," + str(i)
start_urls.append(start_url)
def parse(self, response):
for sel in response.xpath("//*[#class='results']/tr/td[3]"):
item = MovieItem()
item['Title'] = sel.xpath('a/text()').extract()[0]
item['MianPageUrl']= "http://imdb.com"+sel.xpath('a/#href').extract()[0]
request = scrapy.Request(item['MianPageUrl'], callback=self.parseMovieDetails)
request.meta['item'] = item
yield request
The code that #Greg Sadetsky has provided needs some minor changes. Well only one change that is in the first line of parse_page method.
Just change xpath in the for loop from:
response.xpath("//*[#class='results']/tr/td[3]"):
to
response.xpath("//*[contains(#class,'lister-item-content')]/h3"):
This worked like a charm for me!

Scrapy CrawlSpider rules with multiple callbacks

I'm tring to create an ExampleSpider which implements scrapy CrawlSpider. My ExampleSpider should be able to process pages containing only artist info,
pages containing only album info, and some other pages which contains both album and artist info.
I was able to handle First two scenarios. but the problem occurs in third scenario. I'm using parse_artist(response) method to process artist data, parse_album(response) method to process album data.
My question is, If a page contains both artist and album data, how should I define my rules?
Shoud I do like below? (Two rules for same url pattern)
Should I multiple callbacks? (Does scrapy support multiple callbacks?)
Is there other way to do it. (A proper way)
class ExampleSpider(CrawlSpider):
name = 'example'
start_urls = ['http://www.example.com']
rules = [
Rule(SgmlLinkExtractor(allow=[r'same regex_rule']), callback='parse_artist', follow=True),
Rule(SgmlLinkExtractor(allow=[r'same regex_rule']), callback='parse_album', follow=True),
# more rules .....
]
def parse_artist(self, response):
artist_item = ArtistItem()
try:
# do the scrape and assign to ArtistItem
except Exception:
# ignore for now
pass
return artist_item
pass
def parse_album(self, response):
album_item = AlbumItem()
try:
# do the scrape and assign to AlbumItem
except Exception:
# ignore for now
pass
return album_item
pass
pass
The CrawlSpider calls _requests_to_follow() method to extract urls and generate requests to follow:
def _requests_to_follow(self, response):
if not isinstance(response, HtmlResponse):
return
seen = set()
for n, rule in enumerate(self._rules):
links = [l for l in rule.link_extractor.extract_links(response) if l not in seen]
if links and rule.process_links:
links = rule.process_links(links)
seen = seen.union(links)
for link in links:
r = Request(url=link.url, callback=self._response_downloaded)
r.meta.update(rule=n, link_text=link.text)
yield rule.process_request(r)
As you can see:
The variable seen memorizes urls has been processed.
Every url will be parsed by at most one callback.
You can define a parse_item() to call parse_artist() and parse_album():
rules = [
Rule(SgmlLinkExtractor(allow=[r'same regex_rule']), callback='parse_item', follow=True),
# more rules .....
]
def parse_item(self, response):
yield self.parse_artist(response)
yield self.parse_album(response)

How to get scrapy spider to add information to an item based on a CSV file

As some of you may have gathered, I'm learning scrapy to scrape some data off of Google Scholar for a research project that I am running. I have a file that contains many article titles for which I am scraping citations. I read in the file using pandas, generate the URLs that need scraping, and start scraping.
One problem that I face is 503 errors. Google shuts me off fairly quickly, and many entries remain unscraped. This is a problem that I am working on using some middleware provided by Crawlera.
Another problem I face is that when I export my scraped data, I have a hard time matching the scraped data to what I was trying to look for. My input data is a CSV file with three fields -- 'Authors','Title','pid' where 'pid' is a unique identifier.
I use pandas to read in the file and generate URLs for scholar based off the title. Each time a given URL is scraped, my spider goes through the scholar webpage, and picks up the title, publication information and cites for each article listed on that page.
Here is how I generate the links for scraping:
class ScholarSpider(Spider):
name = "scholarscrape"
allowed_domains = ["scholar.google.com"]
# get the data
data = read_csv("../../data/master_jeea.csv")
# get the titles
queries = data.Title.apply(urllib.quote)
# generate a var to store links
links = []
# create the URLs to crawl
for entry in queries:
links.append("http://scholar.google.com/scholar?q=allintitle%3A"+entry)
# give the URLs to scrapy
start_urls = links
For example, one title from my data file could be the paper 'Elephants Don't Play Chess' by Rodney Brooks with 'pid' 5067. The spider goes to
http://scholar.google.com/scholar?q=allintitle%3Aelephants+don%27t+play+chess
Now on this page, there are six hits. The spider gets all six hits, but they need to be assigned the same 'pid'. I know I need to insert a line somewhere that reads something like item['pid'] = data.pid.apply("something") but I can't figure out exactly how I would do that.
Below is the rest of the code for my spider. I am sure the way to do this is pretty straightforward, but I can't think of how to get the spider to know which entry of data.pid it should look for if that makes sense.
def parse(self, response):
# initialize something to hold the data
items=[]
sel = Selector(response)
# get each 'entry' on the page
# an entry is a self contained div
# that has the title, publication info
# and cites
entries = sel.xpath('//div[#class="gs_ri"]')
# a counter for the entry that is being scraped
count = 1
for entry in entries:
item = ScholarscrapeItem()
# get the title
title = entry.xpath('.//h3[#class="gs_rt"]/a//text()').extract()
# the title is messy
# clean up
item['title'] = "".join(title)
# get publication info
# clean up
author = entry.xpath('.//div[#class="gs_a"]//text()').extract()
item['authors'] = "".join(author)
# get the portion that contains citations
cite_string = entry.xpath('.//div[#class="gs_fl"]//text()').extract()
# find the part that says "Cited by"
match = re.search("Cited by \d+",str(cite_string))
# if it exists, note the number
if match:
cites = re.search("\d+",match.group()).group()
# if not, there is no citation info
else:
cites = None
item['cites'] = cites
item['entry'] = count
# iterate the counter
count += 1
# append this item to the list
items.append(item)
return items
I hope this question is well-defined, but please let me know if I can be more clear. There is really not much else in my scraper except some lines at the top importing things.
Edit 1: Based on suggestions below, I have modified my code as follows:
# test-case: http://scholar.google.com/scholar?q=intitle%3Amigratory+birds
import re
from pandas import *
import urllib
from scrapy.spider import Spider
from scrapy.selector import Selector
from scholarscrape.items import ScholarscrapeItem
class ScholarSpider(Spider):
name = "scholarscrape"
allowed_domains = ["scholar.google.com"]
# get the data
data = read_csv("../../data/master_jeea.csv")
# get the titles
queries = data.Title.apply(urllib.quote)
pid = data.pid
# generate a var to store links
urls = []
# create the URLs to crawl
for entry in queries:
urls.append("http://scholar.google.com/scholar?q=allintitle%3A"+entry)
# give the URLs to scrapy
start_urls = (
(urls, pid),
)
def make_requests_from_url(self, (url,pid)):
return Request(url, meta={'pid':pid}, callback=self.parse, dont_filter=True)
def parse(self, response):
# initialize something to hold the data
items=[]
sel = Selector(response)
# get each 'entry' on the page
# an entry is a self contained div
# that has the title, publication info
# and cites
entries = sel.xpath('//div[#class="gs_ri"]')
# a counter for the entry that is being scraped
count = 1
for entry in entries:
item = ScholarscrapeItem()
# get the title
title = entry.xpath('.//h3[#class="gs_rt"]/a//text()').extract()
# the title is messy
# clean up
item['title'] = "".join(title)
# get publication info
# clean up
author = entry.xpath('.//div[#class="gs_a"]//text()').extract()
item['authors'] = "".join(author)
# get the portion that contains citations
cite_string = entry.xpath('.//div[#class="gs_fl"]//text()').extract()
# find the part that says "Cited by"
match = re.search("Cited by \d+",str(cite_string))
# if it exists, note the number
if match:
cites = re.search("\d+",match.group()).group()
# if not, there is no citation info
else:
cites = None
item['cites'] = cites
item['entry'] = count
item['pid'] = response.meta['pid']
# iterate the counter
count += 1
# append this item to the list
items.append(item)
return items
You need to populate your list start_urls with tuples (url, pid).
Now redefine the method make_requests_from_url(url):
class ScholarSpider(Spider):
name = "ScholarSpider"
allowed_domains = ["scholar.google.com"]
start_urls = (
('http://www.scholar.google.com/', 100),
)
def make_requests_from_url(self, (url, pid)):
return Request(url, meta={'pid': pid}, callback=self.parse, dont_filter=True)
def parse(self, response):
pid = response.meta['pid']
print '!!!!!!!!!!!', pid, '!!!!!!!!!!!!'
pass

Scrapy spider get information that is inside of links

I have done and spider that can take the information of this page and it can follow "Next page" links. Now, the spider just takes the information that i'm showing in the following structure.
The structure of the page is something like this
Title 1
URL 1 ---------> If you click you go to one page with more information
Location 1
Title 2
URL 2 ---------> If you click you go to one page with more information
Location 2
Next page
Then, that i want is that the spider goes on each URL link and get full information. I suppose that i must generate another rule that specify that i want do something like this.
The behaviour of the spider it should be:
Go to URL1 (get info)
Go to URL2 (get info)
...
Next page
But i don't know how i can implement it. Can someone guide me?
Code of my Spider:
class BcnSpider(CrawlSpider):
name = 'bcn'
allowed_domains = ['guia.bcn.cat']
start_urls = ['http://guia.bcn.cat/index.php?pg=search&q=*:*']
rules = (
Rule(
SgmlLinkExtractor(
allow=(re.escape("index.php")),
restrict_xpaths=("//div[#class='paginador']")),
callback="parse_item",
follow=True),
)
def parse_item(self, response):
self.log("parse_item")
sel = Selector(response)
sites = sel.xpath("//div[#id='llista-resultats']/div")
items = []
cont = 0
for site in sites:
item = BcnItem()
item['id'] = cont
item['title'] = u''.join(site.xpath('h3/a/text()').extract())
item['url'] = u''.join(site.xpath('h3/a/#href').extract())
item['when'] = u''.join(site.xpath('div[#class="dades"]/dl/dd[1]/text()').extract())
item['where'] = u''.join(site.xpath('div[#class="dades"]/dl/dd[2]/span/a/text()').extract())
item['street'] = u''.join(site.xpath('div[#class="dades"]/dl/dd[3]/span/text()').extract())
item['phone'] = u''.join(site.xpath('div[#class="dades"]/dl/dd[4]/text()').extract())
items.append(item)
cont = cont + 1
return items
EDIT After searching in internet I found a code with which i can do that.
First of all, I have to get all the links, then I have to call another parse method.
def parse(self, response):
#Get all URL's
yield Request( url= _url, callback=self.parse_details )
def parse_details(self, response):
#Detailed information of each page
If you want use Rules because the page have a paginator, you should change def parse to def parse_start_url and then call this method through Rule. With this changes you make sure that the parser begins at the parse_start_url and the code it would be something like this:
rules = (
Rule(
SgmlLinkExtractor(
allow=(re.escape("index.php")),
restrict_xpaths=("//div[#class='paginador']")),
callback="parse_start_url",
follow=True),
)
def parse_start_url(self, response):
#Get all URL's
yield Request( url= _url, callback=self.parse_details )
def parse_details(self, response):
#Detailed information of each page
Thant's all folks
There is an easier way of achieving this. Click next on your link, and read the new url carefully:
http://guia.bcn.cat/index.php?pg=search&from=10&q=*:*&nr=10
By looking at the get data in the url (everything after the questionmark), and a bit of testing, we find that these mean
from=10 - Starting index
q=*:* - Search query
nr=10 - Number of items to display
This is how I would've done it:
Set nr=100 or higher. (1000 may do as well, just be sure that there is no timeout)
Loop from from=0 to 34300. This is above the number of entries currently. You may want to extract this value first.
Example code:
entries = 34246
step = 100
stop = entries - entries % step + step
for x in xrange(0, stop, step):
url = 'http://guia.bcn.cat/index.php?pg=search&from={}&q=*:*&nr={}'.format(x, step)
# Loop over all entries, and open links if needed

Categories