How to store the URLs crawled with Scrapy? - python

I have a web crawler that crawls for news stories on a web page.
I know how to use the XpathSelector to scrape certain information from the elements in the page.
However I cannot seem to figure out how to store the URL of the page that was just crawled.
class spidey(CrawlSpider):
name = 'spidey'
start_urls = ['http://nytimes.com'] # urls from which the spider will start crawling
rules = [Rule(SgmlLinkExtractor(allow=[r'page/\d+']), follow=True),
# r'page/\d+' : regular expression for http://nytimes.com/page/X URLs
Rule(SgmlLinkExtractor(allow=[r'\d{4}/\d{2}/\w+']), callback='parse_articles')]
# r'\d{4}/\d{2}/\w+' : regular expression for http://nytimes.com/YYYY/MM/title URLs
I want to store every link that passes those rule.
What would I need to add to parse_articles to store the link in my item?
def parse_articles(self, response):
item = SpideyItem()
item['link'] = ???
return item

response.url is what you are looking for.
See docs on response object and check this simple example.

Related

how to scrape javascript web site?

Hello everyone I'm a beginner at scraping and i try to scrape all iPhones in https://www.electroplanet.ma/
this is the scripts i wrote
import scrapy
from ..items import EpItem
class ep(scrapy.Spider):
name = "ep"
start_urls = ["https://www.electroplanet.ma/smartphone-tablette-gps/smartphone/iphone?p=1",
"https://www.electroplanet.ma/smartphone-tablette-gps/smartphone/iphone?p=2"
]
def parse(self, response):
products = response.css("ol li") # to find all items in the page
for product in products :
try:
lien = product.css("a.product-item-link::attr(href)").get() # get the link of each item
image= product.css("a.product-item-photo::attr(href)").get() # get the image
# and to get in each item page and scrap it, i use follow method
# i passed image as argument to parse_item cauz i couldn't scrap the image from item's page
# i think it's hidden
yield response.follow(lien,callback = self.parse_item,cb_kwargs={"image":image})
except: pass
def parse_item(self,response,image):
item = EpItem()
item["Nom"]= response.css(".ref::text").get()
pattern = re.compile(r"\s*(\S+(?:\s+\S+)*)\s*")
item["Catégorie"]= pattern.search(response.xpath("//h1/a/text()").get()).group(1)
item["Marque"]=pattern.search(response.xpath("//*[#data-th='Marque']/text()").get()).group(1)
try :
item["RAM"]= pattern.search(response.xpath("//*[#data-th='MÉMOIRE RAM']/text()").get()).group(1)
except:
pass
item["ROM"]=pattern.search(response.xpath("//*[#data-th='MÉMOIRE DE STOCKAGE']/text()").get()).group(1)
item["Couleur"]=pattern.search(response.xpath("//*[#data-th='COULEUR']/text()").get()).group(1)
item["lien"]=response.request.url
item["image"]=image
item["état"]="neuf"
item["Market"]= "Electro Planet"
yield item
i found problems to scrape all the pages, because it uses javascript to follow pages so i write all pages links in start urls and i believe it's not the best practice so i ask you to give some advices to improve my code
you can use the scrapy-playwright plugin to scrape the interactive websites, and for the start_urls, just add the main website index URL if there is just one website, and check this link in the scrapy docs to make the spider follow the pages links automatically instead of written them manually

How to use crawled output of first scrapy spider for next scrapy spider

I am new to scrapy and I want to do the following:
- I want to crawl a homepage and extract some specific listings
- with these listings I want to adjust the url and crawl the new web page
Crawling First URL
class Spider1:
start_urls = 'https://page1.org/'
def parse(self, response):
listings = response.css('get-listings-here').extract()
Crawling Second URL
class Spider2:
start_urls = 'https://page1.org/listings[output_of_Spider1]'
def parse(self, response):
final_data = response.css('get-needed_data').extract()
items['final'] = final_data
yield items
Maybe it is also possible within one spider, I am not sure. But what would be the best solution for it?
Thank you!
After extracting all links from your selector you need to yield Request to those links and add callback where you will receive HTML response
def parse(self,response):
yield Request(‘http://amazon.com/',callback=self.page)
def page(self,response):
## your new page html response
you can replace your extracted link with this amazon link.
Reference to the documentation scrapy Request

How to crawl and scrape data at the same time?

It's my first experience with web scraping and I'm not sure if I'm doing well or not. The thing is I want to crawl and scrape data at the same time.
Get all the links that I'm gonna scrape
Store them into MongoDB
Visit them one by one to scrape their content
# Crawling: get all links to be scrapped later on
class LinkCrawler(Spider):
name="link"
allowed_domains = ["website.com"]
start_urls = ["https://www.website.com/offres?start=%s" % start for start in xrange(0,10000,20)]
def parse(self,response):
# loop for all pages
next_page = Selector(response).xpath('//li[#class="active"]/following-sibling::li[1]/a/#href').extract()
if not not next_page:
yield Request("https://"+next_page[0], callback = self.parse)
# loop for all links in a single page
links = Selector(response).xpath('//div[#class="row-fluid job-details pointer"]/div[#class="bloc-right"]/div[#class="row-fluid"]')
for link in links:
item = Link()
url = response.urljoin(link.xpath('a/#href')[0].extract())
item['url'] = url
items.append(item)
for item in items:
yield item
# Scraping: get all the stored links on MongoDB and scrape them????
What exactly is your use case? Are you primarily interested in the links or content of the pages they lead to? I.e. is there any reason to first store the links in MongoDB and scrape pages later? If you really need to store links in MongoDB, it's best to use an item pipeline to store the items. In the link, there's even example of storing items in MongoDB. If you need something more sophisticated, look at scrapy-mongodb package.
Other than that, there are some comments to the actual code you posted:
Instead of Selector(response).xpath(...) use just response.xpath(...).
If you need only the first extracted element from selector, use extract_first() instead of using extract() and indexing.
Don't use if not not next_page:, use if next_page:.
The second loop over items is not needed, yield item in the loop over links.

Scrapy: Spider optimization

I'm trying to scrap an e-commerce web site, and I'm doing it in 2 steps.
This website has a structure like this:
The homepage has the links to the family-items and subfamily-items pages
Each family & subfamily page has a list of products paginated
Right now I have 2 spiders:
GeneralSpider to get the homepage links and store them
ItemSpider to get elements from each page
I'm completely new to Scrapy, I'm following some tutorials to achieve this. I'm wondering how complex can be the parse functions and how rules works. My spiders right now looks like:
GeneralSpider:
class GeneralSpider(CrawlSpider):
name = 'domain'
allowed_domains = ['domain.org']
start_urls = ['http://www.domain.org/home']
def parse(self, response):
links = LinksItem()
links['content'] = response.xpath("//div[#id='h45F23']").extract()
return links
ItemSpider:
class GeneralSpider(CrawlSpider):
name = 'domain'
allowed_domains = ['domain.org']
f = open("urls.txt")
start_urls = [url.strip() for url in f.readlines()]
# Each URL in the file has pagination if it has more than 30 elements
# I don't know how to paginate over each URL
f.close()
def parse(self, response):
item = ShopItem()
item['name'] = response.xpath("//h1[#id='u_name']").extract()
item['description'] = response.xpath("//h3[#id='desc_item']").extract()
item['prize'] = response.xpath("//div[#id='price_eur']").extract()
return item
Wich is the best way to make the spider follow the pagination of an url ?
If the pagination is JQuery, meaning there is no GET variable in the URL, Would be possible to follow the pagination ?
Can I have different "rules" in the same spider to scrap different parts of the page ? or is better to have the spiders specialized, each spider focused in one thing?
I've also googled looking for any book related with Scrapy, but it seems there isn't any finished book yet, or at least I couldn't find one.
Does anyone know if some Scrapy book that will be released soon ?
Edit:
This 2 URL's fits for this example. In the Eroski Home page you can get the URL's to the products page.
In the products page you have a list of items paginated (Eroski Items):
URL to get Links: Eroski Home
URL to get Items: Eroski Fruits
In the Eroski Fruits page, the pagination of the items seems to be JQuery/AJAX, because more items are shown when you scroll down, is there a way to get all this items with Scrapy ?
Which is the best way to make the spider follow the pagination of an url ?
This is very site-specific and depends on how the pagination is implemented.
If the pagination is JQuery, meaning there is no GET variable in the URL, Would be possible to follow the pagination ?
This is exactly your use case - the pagination is made via additional AJAX calls that you can simulate inside your Scrapy spider.
Can I have different "rules" in the same spider to scrape different parts of the page ? or is better to have the spiders specialized, each spider focused in one thing?
Yes, the "rules" mechanism that a CrawlSpider provides is a very powerful piece of technology - it is highly configurable - you can have multiple rules, some of them would follow specific links that match specific criteria, or located in a specific section of a page. Having a single spider with multiple rules should be preferred comparing to having multiple spiders.
Speaking about your specific use-case, here is the idea:
make a rule to follow categories and subcategories in the navigation menu of the home page - this is there restrict_xpaths would help
in the callback, for every category or subcategory yield a Request that would mimic the AJAX request sent by your browser when you open a category page
in the AJAX response handler (callback) parse the available items and yield an another Request for the same category/subcategory but increasing the page GET parameter (getting next page)
Example working implementation:
import re
import urllib
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
class ProductItem(scrapy.Item):
description = scrapy.Field()
price = scrapy.Field()
class GrupoeroskiSpider(CrawlSpider):
name = 'grupoeroski'
allowed_domains = ['compraonline.grupoeroski.com']
start_urls = ['http://www.compraonline.grupoeroski.com/supermercado/home.jsp']
rules = [
Rule(LinkExtractor(restrict_xpaths='//div[#class="navmenu"]'), callback='parse_categories')
]
def parse_categories(self, response):
pattern = re.compile(r'/(\d+)\-\w+')
groups = pattern.findall(response.url)
params = {'page': 1, 'categoria': groups.pop(0)}
if groups:
params['grupo'] = groups.pop(0)
if groups:
params['familia'] = groups.pop(0)
url = 'http://www.compraonline.grupoeroski.com/supermercado/ajax/listProducts.jsp?' + urllib.urlencode(params)
yield scrapy.Request(url,
meta={'params': params},
callback=self.parse_products,
headers={'X-Requested-With': 'XMLHttpRequest'})
def parse_products(self, response):
for product in response.xpath('//div[#class="product_element"]'):
item = ProductItem()
item['description'] = product.xpath('.//span[#class="description_1"]/text()').extract()[0]
item['price'] = product.xpath('.//div[#class="precio_line"]/p/text()').extract()[0]
yield item
params = response.meta['params']
params['page'] += 1
url = 'http://www.compraonline.grupoeroski.com/supermercado/ajax/listProducts.jsp?' + urllib.urlencode(params)
yield scrapy.Request(url,
meta={'params': params},
callback=self.parse_products,
headers={'X-Requested-With': 'XMLHttpRequest'})
Hope this is a good starting point for you.
Does anyone know if some Scrapy book that will be released soon?
Nothing specific that I can recall.
Though I heard that some publisher has some plans to may be release a book about web-scraping, but I'm not supposed to tell you that.

Scrapy - no list page, but I know the url for each item page

I'm using Scrapy to scrape a website. The item page that I want to scrape looks like: http://www.somepage.com/itempage/&page=x. Where x is any number from 1 to 100. Thus, I have an SgmlLinkExractor Rule with a callback function specified for any page resembling this.
The website does not have a listpage with all the items, so I want to somehow well scrapy to scrape those urls (from 1 to 100). This guy here seemed to have the same issue, but couldn't figure it out.
Does anyone have a solution?
You could list all the known URLs in your Spider class' start_urls attribute:
class SomepageSpider(BaseSpider):
name = 'somepage.com'
allowed_domains = ['somepage.com']
start_urls = ['http://www.somepage.com/itempage/&page=%s' % page for page in xrange(1, 101)]
def parse(self, response):
# ...
If it's just a one time thing, you can create a local html file file:///c:/somefile.html with all the links. Start scraping that file and add somepage.com to allowed domains.
Alternately, in the parse function, you can return a new Request which is the next url to be scraped.

Categories