I have built a very simple scraper using Scrapy. For the output table, I would like to show the Google News search term as well as the Google resultstats value.
The information I would like to capture is showing in the source of the Google page as
<input class="gsfi" value="Elon Musk">
and
<div id="resultStats">About 52,300 results</div>
I have already tried to include both through ('input.value::text') and ('id.resultstats::text'), which did not work, however. Does anyone have an idea how to solve this situation?
import scrapy
class QuotesSpider(scrapy.Spider):
name = 'quotes'
allowed_domains = ['google.com']
start_urls = ['https://www.google.com/search?q=elon+musk&source=lnt&tbs=cdr%3A1%2Ccd_min%3A1%2F1%2F2015%2Ccd_max%3A12%2F31%2F2015&tbm=nws']
def parse(self, response):
for quote in response.css('div.quote'):
item = {
'search_title': quote.css('input.value::text').extract(),
'results': quote.css('id.resultstats::text').extract(),
}
yield item
The pages renders differently when you access it with Scrapy.
The search field becomes:
response.css('input#sbhost::attr(value)').get()
The results count is:
response.css('#resultStats::text').get()
Also, there is no quote class on that page.
You can test this in the scrapy shell:
scrapy shell -s ROBOTSTXT_OBEY=False "https://www.google.com/search?q=elon+musk&source=lnt&tbs=cdr%3A1%2Ccd_min%3A1%2F1%2F2015%2Ccd_max%3A12%2F31%2F2015&tbm=nws"
And then run those 2 commands.
[EDIT]
If your goal is to get one item for each URL, then you can do this:
def parse(self, response):
item = {
'search_title': response.css('input#sbhost::attr(value)').get(),
'results': response.css('#resultStats::text').get(),
}
yield item
If your goal is to extract every result on the page, then you need something different.
Related
I'm starting around with Scrapy, and managed to extract some of the data I need. However, not everything is properly obtained. I'm applying the knowledge from the official tutorial found here, but it's not working. I've Googled around a bit, and also read this SO question but I'm fairly certain this isn't the problem here.
Anyhow, I'm trying to parse the product information from this webshop. I'm trying to obtain the product name, price, rrp, release date, category, universe, author and publisher. Here is the relevant CSS for one product: https://pastebin.com/9tqnjs7A. Here's my code. Everything with a #! at the end isn't working as expected.
import scrapy
import pprint
class ForbiddenPlanetSpider(scrapy.Spider):
name = "fp"
start_urls = [
'https://forbiddenplanet.com/catalog/?q=mortal%20realms&sort=release-date&page=1',
]
def parse(self, response):
for item in response.css("section.zshd-00"):
print(response.css)
name = item.css("h3.h4::text").get() #!
price = item.css("span.clr-price::text").get() + item.css("span.t-small::text").get()
rrp = item.css("del.mqr::text").get()
release = item.css("dd.mzl").get() #!
category = item.css("li.inline-list__item::text").get() #!
universe = item.css("dt.txt").get() #!
authors = item.css("a.SubTitleItems").get() #!
publisher = item.css("dd.mzl").get() #!
pprint.pprint(dict(name=name,
price=price,
rrp=rrp,
release=release,
category=category,
universe=universe,
authors=authors,
publisher = publisher
)
)
I think I need to add some sub-searching (at the moment release and publisher have the same criteria, for example), but I don't know how to word it to search for it (I've tried, but ended up with generic tutorials that don't cover it). Anything pointing me in the right direction is appreciated!
Oh, and I didn't include ' ' spaces because whenever I used one Scrapy immediately failed to find.
Scrapy doesn't render JS, try to disable javascript in your browser and refresh the page, the HTML structure is different for site version without JS.
you should rewrite your selectors with a new HTML structure. Try to use XPATH instead of CSS it's much flexible.
UPD
The easiest way to scrape this website makes a request to https://forbiddenplanet.com/api/products/listing/?q=mortal%20realms&sort=release-date
The response is a JSON object with all necessary data. You may transform the "results" field (or the whole JSON object) to a python dictionary and get all fields with dictionary methods.
A code draft that works and shows the idea.
import scrapy
import json
def get_tags(tags: list):
parsed_tags = []
if tags:
for tag in tags:
parsed_tags.append(tag.get('name'))
return parsed_tags
return None
class ForbiddenplanetSpider(scrapy.Spider):
name = 'forbiddenplanet'
allowed_domains = ['forbiddenplanet.com']
start_urls = ['https://forbiddenplanet.com/api/products/listing/?q=mortal%20realms&sort=release-date']
def parse(self, response):
response_dict = json.loads(response.body)
items = response_dict.get('results')
for item in items:
yield {
'name': item.get('title'),
'price': item.get('site_price'),
'rrp': item.get('rrp'),
'release': item.get('release_date'),
'category': get_tags(item.get('derived_tags').get('type')),
'universe': get_tags(item.get('derived_tags').get('universe')),
'authors': get_tags(item.get('derived_tags').get('author')),
'publisher': get_tags(item.get('derived_tags').get('publisher')),
}
next_page = response_dict.get('next')
if next_page:
yield scrapy.Request(
url=next_page,
callback=self.parse
)
I'm very new to python and am trying to build a script that will eventually extract page titles and s from specified URLs to a .csv in the format I specify.
I have tried managed to get the spider to work in CMD using :
response.xpath("/html/head/title/text()").get()
So the xpath must be right.
Unfortunately when I run the the file my spider is in it never seems to work properly. I think the issue is in the final block of code, unfortunately all the guides I follow seem to use CSS. I feel more comfortable with xpath because you can simply copy,paste it from Dev Tools.
import scrapy
class PageSpider(scrapy.Spider):
name = "dorothy"
start_urls = [
"http://www.example.com",
"http://www.example.com/blog"]
def parse(self, response):
for title in response.xpath("/html/head/title/text()"):
yield {
"title": sel.xpath("Title a::text").extract_first()
}
I expected when that to give me the page title of the above URLs.
First of all, your second url on self.start_urls is invalid and returning 404, so you will end up with only one title extracted.
Second, you need to read more about selectors, you extracted the title on your test on shell but got confused when using it on your spider.
Scrapy will call the parse method for each url on self.start_urls, so you don't need to iterate trough titles, you only have one per page.
You also can access the <title> tag directly using // at the beginning of your xpath expression, see this text copied from W3Schools :
/ Selects from the root node
// Selects nodes in the document from the current node that match the selection no matter where they are
Here is the fixed code:
import scrapy
class PageSpider(scrapy.Spider):
name = "dorothy"
start_urls = [
"http://www.example.com"
]
def parse(self, response):
yield {
"title": response.xpath('//title/text()').extract_first()
}
Hi guys I am very new in scraping data, I have tried the basic one. But my problem is I have 2 web page with same domain that I need to scrape
My Logic is,
First page www.sample.com/view-all.html
*This page open all the list of items and I need to get all the href attr of every item.
Second page www.sample.com/productpage.52689.html
*this is the link came from the first page so 52689 needs to change dynamically depending on the link provided by the first page.
I need to get all the data like title, description etc on the second page.
what I am thinking is for loop but Its not working on my end. I am searching on google but no one has the same problem as mine. please help me
import scrapy
class SalesItemSpider(scrapy.Spider):
name = 'sales_item'
allowed_domains = ['www.sample.com']
start_urls = ['www.sample.com/view-all.html', 'www.sample.com/productpage.00001.html']
def parse(self, response):
for product_item in response.css('li.product-item'):
item = {
'URL': product_item.css('a::attr(href)').extract_first(),
}
yield item`
Inside parse you can yield Request() with url and function's name to scrape this url in different function
def parse(self, response):
for product_item in response.css('li.product-item'):
url = product_item.css('a::attr(href)').extract_first()
# it will send `www.sample.com/productpage.52689.html` to `parse_subpage`
yield scrapy.Request(url=url, callback=self.parse_subpage)
def parse_subpage(self, response):
# here you parse from www.sample.com/productpage.52689.html
item = {
'title': ...,
'description': ...
}
yield item
Look for Request in Scrapy documentation and its tutorial
There is also
response.follow(url, callback=self.parse_subpage)
which will automatically add www.sample.com to urls so you don't have to do it on your own in
Request(url = "www.sample.com/" + url, callback=self.parse_subpage)
See A shortcut for creating Requests
If you interested in scraping then you should read docs.scrapy.org from first page to the last one.
I am trying out scrapy now. I tried the example code in http://doc.scrapy.org/en/1.0/intro/overview.html page. I tried extracting the recent questions with tag 'bigdata'. Everything worked well. But when I tried to extract questions with both tags 'bigdata' and 'python', the results were not correct, with questions having only 'bigdata' tag coming in the result. But on browser I am getting questions with both the tags correctly. Please find the code below:
import scrapy
class StackOverflowSpider(scrapy.Spider):
name = 'stackoverflow'
start_urls = ['https://stackoverflow.com/questions/tagged/bigdata?page=1&sort=newest&pagesize=50']
def parse(self, response):
for href in response.css('.question-summary h3 a::attr(href)'):
full_url = response.urljoin(href.extract())
yield scrapy.Request(full_url, callback=self.parse_question)
def parse_question(self, response):
yield {
'title': response.css('h1 a::text').extract()[0],
'votes': response.css('.question .vote-count-post::text').extract()[0],
'body': response.css('.question .post-text').extract()[0],
'tags': response.css('.question .post-tag::text').extract(),
'link': response.url,
}
When I change start_urls as
start_urls = ['https://stackoverflow.com/questions/tagged/bigdata+python?page=1&sort=newest&pagesize=50']
the results contain questions with only 'bigdata' tag. How to get questions with both the tags only?
Edit: I think what is happening is that scrapy is going into pages with tag 'bigdata' from the main page I gave because the tags are links to the main page for that tag. How can I edit this code to make scrapy not go into the tag pages and only questions in that page? I tried using rules like below but results were still not right.
rules = (Rule(LinkExtractor(restrict_css='.question-summary h3 a::attr(href)'), callback='parse_question'),)
The url you have (as well as the initial css rules) is correct; or more simply:
start_urls = ['https://stackoverflow.com/questions/tagged/python+bigdata']
Extrapolating from this, this will also work:
start_urls = ['https://stackoverflow.com/questions/tagged/bigdata%20python']
The issue you are running into however, is that stackoverflow appears to require you to be logged in to access the multiple tag search feature. To see this, simply log out of your stackoverflow session and try the same url in your browser. It will redirect you to a page of results for the first of the two tags only.
TL;DR the only way to get the multiple tags feature appears to be logging in (enforced via session cookies)
Thus, when using scrapy, the fix is to authenticate the session (login) before doing anything else, and then proceed to parse as normal and it all works. To do this, you can use an InitSpider instead of Spider and add the appropriate login methods. Assuming you login with StackOverflow directly (as opposed to through Google or the like), I was able to get it working as expected like this:
import scrapy
import getpass
from scrapy.spiders.init import InitSpider
class StackOverflowSpider(InitSpider):
name = 'stackoverflow'
login_page = 'https://stackoverflow.com/users/login'
start_urls = ['https://stackoverflow.com/questions/tagged/bigdata+python']
def parse(self, response):
...
def parse_question(self, response):
...
def init_request(self):
return scrapy.Request(url=self.login_page, callback=self.login)
def login(self, response):
return scrapy.FormRequest.from_response(response,
formdata={'email': 'yourEmailHere#foobar.com',
'password': getpass.getpass()},
callback=self.check_login_response)
def check_login_response(self, response):
if "/users/logout" in response.body:
self.log("Successfully logged in")
return self.initialized()
else:
self.log("Failed login")
I'm trying to scrap an e-commerce web site, and I'm doing it in 2 steps.
This website has a structure like this:
The homepage has the links to the family-items and subfamily-items pages
Each family & subfamily page has a list of products paginated
Right now I have 2 spiders:
GeneralSpider to get the homepage links and store them
ItemSpider to get elements from each page
I'm completely new to Scrapy, I'm following some tutorials to achieve this. I'm wondering how complex can be the parse functions and how rules works. My spiders right now looks like:
GeneralSpider:
class GeneralSpider(CrawlSpider):
name = 'domain'
allowed_domains = ['domain.org']
start_urls = ['http://www.domain.org/home']
def parse(self, response):
links = LinksItem()
links['content'] = response.xpath("//div[#id='h45F23']").extract()
return links
ItemSpider:
class GeneralSpider(CrawlSpider):
name = 'domain'
allowed_domains = ['domain.org']
f = open("urls.txt")
start_urls = [url.strip() for url in f.readlines()]
# Each URL in the file has pagination if it has more than 30 elements
# I don't know how to paginate over each URL
f.close()
def parse(self, response):
item = ShopItem()
item['name'] = response.xpath("//h1[#id='u_name']").extract()
item['description'] = response.xpath("//h3[#id='desc_item']").extract()
item['prize'] = response.xpath("//div[#id='price_eur']").extract()
return item
Wich is the best way to make the spider follow the pagination of an url ?
If the pagination is JQuery, meaning there is no GET variable in the URL, Would be possible to follow the pagination ?
Can I have different "rules" in the same spider to scrap different parts of the page ? or is better to have the spiders specialized, each spider focused in one thing?
I've also googled looking for any book related with Scrapy, but it seems there isn't any finished book yet, or at least I couldn't find one.
Does anyone know if some Scrapy book that will be released soon ?
Edit:
This 2 URL's fits for this example. In the Eroski Home page you can get the URL's to the products page.
In the products page you have a list of items paginated (Eroski Items):
URL to get Links: Eroski Home
URL to get Items: Eroski Fruits
In the Eroski Fruits page, the pagination of the items seems to be JQuery/AJAX, because more items are shown when you scroll down, is there a way to get all this items with Scrapy ?
Which is the best way to make the spider follow the pagination of an url ?
This is very site-specific and depends on how the pagination is implemented.
If the pagination is JQuery, meaning there is no GET variable in the URL, Would be possible to follow the pagination ?
This is exactly your use case - the pagination is made via additional AJAX calls that you can simulate inside your Scrapy spider.
Can I have different "rules" in the same spider to scrape different parts of the page ? or is better to have the spiders specialized, each spider focused in one thing?
Yes, the "rules" mechanism that a CrawlSpider provides is a very powerful piece of technology - it is highly configurable - you can have multiple rules, some of them would follow specific links that match specific criteria, or located in a specific section of a page. Having a single spider with multiple rules should be preferred comparing to having multiple spiders.
Speaking about your specific use-case, here is the idea:
make a rule to follow categories and subcategories in the navigation menu of the home page - this is there restrict_xpaths would help
in the callback, for every category or subcategory yield a Request that would mimic the AJAX request sent by your browser when you open a category page
in the AJAX response handler (callback) parse the available items and yield an another Request for the same category/subcategory but increasing the page GET parameter (getting next page)
Example working implementation:
import re
import urllib
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
class ProductItem(scrapy.Item):
description = scrapy.Field()
price = scrapy.Field()
class GrupoeroskiSpider(CrawlSpider):
name = 'grupoeroski'
allowed_domains = ['compraonline.grupoeroski.com']
start_urls = ['http://www.compraonline.grupoeroski.com/supermercado/home.jsp']
rules = [
Rule(LinkExtractor(restrict_xpaths='//div[#class="navmenu"]'), callback='parse_categories')
]
def parse_categories(self, response):
pattern = re.compile(r'/(\d+)\-\w+')
groups = pattern.findall(response.url)
params = {'page': 1, 'categoria': groups.pop(0)}
if groups:
params['grupo'] = groups.pop(0)
if groups:
params['familia'] = groups.pop(0)
url = 'http://www.compraonline.grupoeroski.com/supermercado/ajax/listProducts.jsp?' + urllib.urlencode(params)
yield scrapy.Request(url,
meta={'params': params},
callback=self.parse_products,
headers={'X-Requested-With': 'XMLHttpRequest'})
def parse_products(self, response):
for product in response.xpath('//div[#class="product_element"]'):
item = ProductItem()
item['description'] = product.xpath('.//span[#class="description_1"]/text()').extract()[0]
item['price'] = product.xpath('.//div[#class="precio_line"]/p/text()').extract()[0]
yield item
params = response.meta['params']
params['page'] += 1
url = 'http://www.compraonline.grupoeroski.com/supermercado/ajax/listProducts.jsp?' + urllib.urlencode(params)
yield scrapy.Request(url,
meta={'params': params},
callback=self.parse_products,
headers={'X-Requested-With': 'XMLHttpRequest'})
Hope this is a good starting point for you.
Does anyone know if some Scrapy book that will be released soon?
Nothing specific that I can recall.
Though I heard that some publisher has some plans to may be release a book about web-scraping, but I'm not supposed to tell you that.