Unable to scrape some URLs from a webpage

Unable to scrape some URLs from a webpage - python

I am trying to scrape all the restaurant URLs on a page. There are only 5 restaurant URLs to scrape in this particular example.
At this stage, I am just trying to print them to see if my code works. However, I am not even able to get that done: - my code is unable to find any of the URLs.
import scrapy
from hungryhouse.items import HungryhouseItem
class HungryhouseSpider(scrapy.Spider):
name = "hungryhouse"
allowed_domains = ["hungryhouse.co.uk"]
start_urls = ["https://hungryhouse.co.uk/takeaways/westhill-ab32",
]
def parse(self, response):
for href in response.xpath('//div[#class="restsRestInfo"]/a/#href'):
url = response.urljoin(href.extract())
print url
any guidance as to why the five URLs are not being found would be gratefully received.

Related

How to use crawled output of first scrapy spider for next scrapy spider

I am new to scrapy and I want to do the following:
- I want to crawl a homepage and extract some specific listings
- with these listings I want to adjust the url and crawl the new web page
Crawling First URL
class Spider1:
start_urls = 'https://page1.org/'
def parse(self, response):
listings = response.css('get-listings-here').extract()
Crawling Second URL
class Spider2:
start_urls = 'https://page1.org/listings[output_of_Spider1]'
def parse(self, response):
final_data = response.css('get-needed_data').extract()
items['final'] = final_data
yield items
Maybe it is also possible within one spider, I am not sure. But what would be the best solution for it?
Thank you!

After extracting all links from your selector you need to yield Request to those links and add callback where you will receive HTML response
def parse(self,response):
yield Request(‘http://amazon.com/',callback=self.page)
def page(self,response):
## your new page html response
you can replace your extracted link with this amazon link.
Reference to the documentation scrapy Request

Scrapy crawler pagination for indeed website

# -*- coding: utf-8 -*-
import scrapy
class SearchSpider(scrapy.Spider):
name = 'search'
allowed_domains = ['www.indeed.com/']
start_urls = ['https://www.indeed.com/jobs?q=data%20analyst&l=united%20states']
def parse(self, response):
listings = response.xpath('//*[#data-tn-component="organicJob"]')
for listing in listings:
title = listing.xpath('.//a[#data-tn-element="jobTitle"]/#title').extract_first()
link = listing.xpath('.//h2[#class="title"]//a/#href').extract_first()
company = listing.xpath('normalize-space(.//span[#class="company"]//a/text())').extract_first()
yield {'title':title,
'link':link,
'company':company}
next_page = response.xpath('//ul[#class="pagination-list"]//a/#href').extract_first()
if next_page:
yield scrapy.Request(response.urljoin(next_page),callback=self.parse)
I am trying to extract all the job titles and company for every job posting in all the indeed pages. However, I am stuck at a point, because the forward button on the indeed page does not have a fixed link which my scraper could follow instead the next page url is the same as the numbered button. Which means that even after requesting the next page url, the numbers at the end change which does not allow me to extract the next page. I am trying to refrain from using selenium or splash, since I am trying to get my results through only Scrapy or Beautifull Soup. However, any help would be greatly appreciated.

How to get links inside links from webpage in python?

How can i go to link and get its sub links and again get its sub sub links?like for example,
I want to go to
"https://stackoverflow.com"
then extract its links e.g
['https://stackoverflow.com/questions/ask', 'https://stackoverflow.com/?tab=bounties']
and again go to that sub link and extract those sub links links.

I would recommend using Scrapy for this. With Scrapy, you create a spider object which then is run by the Scrapy module.
First, to get all the links on a page, you can create a Selector object and find all of the hyperlink objects using the XPath:
hxs = scrapy.Selector(response)
urls = hxs.xpath('*//a/#href').extract()
Since the hxs.xpath returns an iterable list of paths, you can just iterate over them directly without storing them in a variable. Also each URL found should be passed back into this function using the callback argument, allowing it to recursively find all the links within each URL found:
hxs = scrapy.Selector(response)
for url in hxs.xpath('*//a/#href').extract():
yield scrapy.http.Request(url=url, callback=self.parse)
Each path found might not contain the original URL, so that check has to be made:
if not ( url.startswith('http://') or url.startswith('https://') ):
url = "https://stackoverflow.com/" + url
Finally, the each URL can be passed to a different function to be parsed, in this case it's just printed:
self.handle(url)
All of this put together in a full Spider object looks like this:
import scrapy
class StackSpider(scrapy.Spider):
name = "stackoverflow.com"
# limit the scope to stackoverflow
allowed_domains = ["stackoverflow.com"]
start_urls = [
"https://stackoverflow.com/",
]
def parse(self, response):
hxs = scrapy.Selector(response)
# extract all links from page
for url in hxs.xpath('*//a/#href').extract():
# make it a valid url
if not ( url.startswith('http://') or url.startswith('https://') ):
url = "https://stackoverflow.com/" + url
# process the url
self.handle(url)
# recusively parse each url
yield scrapy.http.Request(url=url, callback=self.parse)
def handle(self, url):
print(url)
And the spider would be run like this:
$ scrapy runspider spider.py > urls.txt
Also, keep in mind that running this code will get you rate limited from stack overflow. You might want to find a different target for testing, ideally a site that you're hosting yourself.

python-scrapy project to return list of urls, and scrape content inside of urls

Currently trying to scrape this (page: 'https://sportschatplace.com/nba-picks') for a project with scrapy spiders for each game url and then go into each game's page and get more information inside.
When I run it, it just returns with no pages scraped. Any help would be appreciated. Here's a snippet of my code:
class GameSpider(scrapy.Spider):
name = 'games'
allowed_domains = ['sportschatplace.com']
start_urls = [
'https://sportschatplace.com/nba-picks'
]
def parse(self, response):
games = response.css("div.home-a").extract_first()
for g in games:
url = urljoin(response.url, g)
yield scrapy.Request(url, callback = self.parse_game)
def parse_game(self, response):
for info in response.css('div.gutter'):
yield {
'game_teams': info.css('p.heading-sub').extract_first(), #check if these are correct before running
'game_datetime': info.css('h2.heading-sub').extract_first(),
'game_line': info.css('h3.heading-sub').extract_first(),
# 'game_text': info.css(' ').extract(),
'game_pick': info.css('h3.block mt1 dark-gray').extract(),
}

games = response.css("div.home-a").extract_first()
div.home-a contains multiple divs, and you are extracting the first one and also extract_first() convert that div into a string.
What i got from the link is that, your css is not giving you what you want.
Try this
css = '[itemprop="url"]::attr(href)'
games = response.css(css).extract() #list of game urls

Scrapy: Spider optimization

I'm trying to scrap an e-commerce web site, and I'm doing it in 2 steps.
This website has a structure like this:
The homepage has the links to the family-items and subfamily-items pages
Each family & subfamily page has a list of products paginated
Right now I have 2 spiders:
GeneralSpider to get the homepage links and store them
ItemSpider to get elements from each page
I'm completely new to Scrapy, I'm following some tutorials to achieve this. I'm wondering how complex can be the parse functions and how rules works. My spiders right now looks like:
GeneralSpider:
class GeneralSpider(CrawlSpider):
name = 'domain'
allowed_domains = ['domain.org']
start_urls = ['http://www.domain.org/home']
def parse(self, response):
links = LinksItem()
links['content'] = response.xpath("//div[#id='h45F23']").extract()
return links
ItemSpider:
class GeneralSpider(CrawlSpider):
name = 'domain'
allowed_domains = ['domain.org']
f = open("urls.txt")
start_urls = [url.strip() for url in f.readlines()]
# Each URL in the file has pagination if it has more than 30 elements
# I don't know how to paginate over each URL
f.close()
def parse(self, response):
item = ShopItem()
item['name'] = response.xpath("//h1[#id='u_name']").extract()
item['description'] = response.xpath("//h3[#id='desc_item']").extract()
item['prize'] = response.xpath("//div[#id='price_eur']").extract()
return item
Wich is the best way to make the spider follow the pagination of an url ?
If the pagination is JQuery, meaning there is no GET variable in the URL, Would be possible to follow the pagination ?
Can I have different "rules" in the same spider to scrap different parts of the page ? or is better to have the spiders specialized, each spider focused in one thing?
I've also googled looking for any book related with Scrapy, but it seems there isn't any finished book yet, or at least I couldn't find one.
Does anyone know if some Scrapy book that will be released soon ?
Edit:
This 2 URL's fits for this example. In the Eroski Home page you can get the URL's to the products page.
In the products page you have a list of items paginated (Eroski Items):
URL to get Links: Eroski Home
URL to get Items: Eroski Fruits
In the Eroski Fruits page, the pagination of the items seems to be JQuery/AJAX, because more items are shown when you scroll down, is there a way to get all this items with Scrapy ?

Which is the best way to make the spider follow the pagination of an url ?
This is very site-specific and depends on how the pagination is implemented.
If the pagination is JQuery, meaning there is no GET variable in the URL, Would be possible to follow the pagination ?
This is exactly your use case - the pagination is made via additional AJAX calls that you can simulate inside your Scrapy spider.
Can I have different "rules" in the same spider to scrape different parts of the page ? or is better to have the spiders specialized, each spider focused in one thing?
Yes, the "rules" mechanism that a CrawlSpider provides is a very powerful piece of technology - it is highly configurable - you can have multiple rules, some of them would follow specific links that match specific criteria, or located in a specific section of a page. Having a single spider with multiple rules should be preferred comparing to having multiple spiders.
Speaking about your specific use-case, here is the idea:
make a rule to follow categories and subcategories in the navigation menu of the home page - this is there restrict_xpaths would help
in the callback, for every category or subcategory yield a Request that would mimic the AJAX request sent by your browser when you open a category page
in the AJAX response handler (callback) parse the available items and yield an another Request for the same category/subcategory but increasing the page GET parameter (getting next page)
Example working implementation:
import re
import urllib
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
class ProductItem(scrapy.Item):
description = scrapy.Field()
price = scrapy.Field()
class GrupoeroskiSpider(CrawlSpider):
name = 'grupoeroski'
allowed_domains = ['compraonline.grupoeroski.com']
start_urls = ['http://www.compraonline.grupoeroski.com/supermercado/home.jsp']
rules = [
Rule(LinkExtractor(restrict_xpaths='//div[#class="navmenu"]'), callback='parse_categories')
]
def parse_categories(self, response):
pattern = re.compile(r'/(\d+)\-\w+')
groups = pattern.findall(response.url)
params = {'page': 1, 'categoria': groups.pop(0)}
if groups:
params['grupo'] = groups.pop(0)
if groups:
params['familia'] = groups.pop(0)
url = 'http://www.compraonline.grupoeroski.com/supermercado/ajax/listProducts.jsp?' + urllib.urlencode(params)
yield scrapy.Request(url,
meta={'params': params},
callback=self.parse_products,
headers={'X-Requested-With': 'XMLHttpRequest'})
def parse_products(self, response):
for product in response.xpath('//div[#class="product_element"]'):
item = ProductItem()
item['description'] = product.xpath('.//span[#class="description_1"]/text()').extract()[0]
item['price'] = product.xpath('.//div[#class="precio_line"]/p/text()').extract()[0]
yield item
params = response.meta['params']
params['page'] += 1
url = 'http://www.compraonline.grupoeroski.com/supermercado/ajax/listProducts.jsp?' + urllib.urlencode(params)
yield scrapy.Request(url,
meta={'params': params},
callback=self.parse_products,
headers={'X-Requested-With': 'XMLHttpRequest'})
Hope this is a good starting point for you.
Does anyone know if some Scrapy book that will be released soon?
Nothing specific that I can recall.
Though I heard that some publisher has some plans to may be release a book about web-scraping, but I'm not supposed to tell you that.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Unable to scrape some URLs from a webpage - python

Related

How to use crawled output of first scrapy spider for next scrapy spider

Scrapy crawler pagination for indeed website

How to get links inside links from webpage in python?

python-scrapy project to return list of urls, and scrape content inside of urls

Scrapy: Spider optimization

Categories

Resources