Scrapy following link not getting data - python

I am trying to follow a list of links and scrap data from each link with a simple scrapy spider but I am having trouble.
In the scrapy shell when I recreate the script it sends the get request of the new url but when I run the crawl I do not get any data back from the link. The only data I get back is from the starting url that was scraped before going to the link.
How do I scrape data from the link?
import scrapy
class QuotesSpider(scrapy.Spider):
name = "players"
start_urls = ['http://wiki.teamliquid.net/counterstrike/Portal:Teams']
def parse(self, response):
teams = response.xpath('//*[#id="mw-content-text"]/table[1]')
for team in teams.css('span.team-template-text'):
yield{
'teamName': team.css('a::text').extract_first()
}
urls = teams.css('span.team-template-text a::attr(href)')
for url in urls:
url = response.urljoin(url)
yield scrapy.Request(url, callback=self.parse_team_info)
def parse_team_info(self, response):
yield{
'Test': response.css('span::text').extract_first()
}

Instead of using
url = response.urljoin(url)
yield scrapy.Request(url, callback=self.parse_team_info)
use
yield response.follow(url, callback=self.parse_team_info)

Related

How to use crawled output of first scrapy spider for next scrapy spider

I am new to scrapy and I want to do the following:
- I want to crawl a homepage and extract some specific listings
- with these listings I want to adjust the url and crawl the new web page
Crawling First URL
class Spider1:
start_urls = 'https://page1.org/'
def parse(self, response):
listings = response.css('get-listings-here').extract()
Crawling Second URL
class Spider2:
start_urls = 'https://page1.org/listings[output_of_Spider1]'
def parse(self, response):
final_data = response.css('get-needed_data').extract()
items['final'] = final_data
yield items
Maybe it is also possible within one spider, I am not sure. But what would be the best solution for it?
Thank you!
After extracting all links from your selector you need to yield Request to those links and add callback where you will receive HTML response
def parse(self,response):
yield Request(‘http://amazon.com/',callback=self.page)
def page(self,response):
## your new page html response
you can replace your extracted link with this amazon link.
Reference to the documentation scrapy Request

Any idea why Scrapy Response.follow is not following the links?

I have created a spider as you would see below. I can get links from homepage but when I want to use them in function scrapy doesn't follow the links. I don't get any http or server error from source.
class GamerSpider(scrapy.Spider):
name = 'gamer'
allowed_domains = ['eurogamer.net']
start_urls = ['http://www.eurogamer.net/archive/ps4']
def parse(self, response):
for link in response.xpath("//h2"):
link=link.xpath(".//a/#href").get()
content=response.xpath("//div[#class='details']/p/text()").get()
yield response.follow(url=link, callback=self.parse_game, meta={'url':link,'content':content})
next_page = 'http://www.eurogamer.net'+response.xpath("//div[#class='buttons forward']/a[#class='button next']/#href").get()
if next_page:
yield scrapy.Request(url=next_page, callback=self.parse)
def parse_game(self, response):
url=response.request.meta['url']
#some things to get
rows=response.xpath("//main")
for row in rows:
#some things to get
yield{
'url':url
#some things to get
}
Any help?

How to scrape 2 web page with same domain on scrapy using python?

Hi guys I am very new in scraping data, I have tried the basic one. But my problem is I have 2 web page with same domain that I need to scrape
My Logic is,
First page www.sample.com/view-all.html
*This page open all the list of items and I need to get all the href attr of every item.
Second page www.sample.com/productpage.52689.html
*this is the link came from the first page so 52689 needs to change dynamically depending on the link provided by the first page.
I need to get all the data like title, description etc on the second page.
what I am thinking is for loop but Its not working on my end. I am searching on google but no one has the same problem as mine. please help me
import scrapy
class SalesItemSpider(scrapy.Spider):
name = 'sales_item'
allowed_domains = ['www.sample.com']
start_urls = ['www.sample.com/view-all.html', 'www.sample.com/productpage.00001.html']
def parse(self, response):
for product_item in response.css('li.product-item'):
item = {
'URL': product_item.css('a::attr(href)').extract_first(),
}
yield item`
Inside parse you can yield Request() with url and function's name to scrape this url in different function
def parse(self, response):
for product_item in response.css('li.product-item'):
url = product_item.css('a::attr(href)').extract_first()
# it will send `www.sample.com/productpage.52689.html` to `parse_subpage`
yield scrapy.Request(url=url, callback=self.parse_subpage)
def parse_subpage(self, response):
# here you parse from www.sample.com/productpage.52689.html
item = {
'title': ...,
'description': ...
}
yield item
Look for Request in Scrapy documentation and its tutorial
There is also
response.follow(url, callback=self.parse_subpage)
which will automatically add www.sample.com to urls so you don't have to do it on your own in
Request(url = "www.sample.com/" + url, callback=self.parse_subpage)
See A shortcut for creating Requests
If you interested in scraping then you should read docs.scrapy.org from first page to the last one.

Scrapy extract table from website

I am a Python novice and am trying to write a script to extract the data from this page. Using scrapy, I wrote the following code:
import scrapy
class dairySpider(scrapy.Spider):
name = "dairy_price"
def start_requests(self):
urls = [
'http://www.dairy.com/market-prices/?page=quote&sym=DAH15&mode=i',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
for rows in response.xpath("//tr"):
yield {
'text': rows.xpath(".//td/text()").extract().strip('. \n'),
}
However, this didn't scrape anything. Do you have any ideas ?
Thanks
The table on the page http://www.dairy.com/market-prices/?page=quote&sym=DAH15&mode=i is being dynamically added to the DOM by making request to http://shared.websol.barchart.com/quotes/quote.php?page=quote&sym=DAH15&mode=i&domain=blimling&display_ice=&enabled_ice_exchanges=&tz=0&ed=0.
You should be scrapping the second link instead of first. As scrapy.Request will only return html source code and not the content added using javascript.
UPDATE
Here is the working code for extracting table data
import scrapy
class dairySpider(scrapy.Spider):
name = "dairy_price"
def start_requests(self):
urls = [
"http://shared.websol.barchart.com/quotes/quote.php?page=quote&sym=DAH15&mode=i&domain=blimling&display_ice=&enabled_ice_exchanges=&tz=0&ed=0",
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
for row in response.css(".bcQuoteTable tbody tr"):
print row.xpath("td//text()").extract()
Make sure you edit your settings.py file and change ROBOTSTXT_OBEY = True to ROBOTSTXT_OBEY = False

Only 25 entries are stored in JSON files while scraping data using Scrapy; how to increase?

I am scraping data using Scrapy in a item.json file. Data is getting stored but the problem is only 25 entries are stored, while in the website there are more entries. I am using the following command:
class DmozSpider(Spider):
name = "dmoz"
allowed_domains = ["justdial.com"]
start_urls = ["http://www.justdial.com/Delhi-NCR/Taxi-Services/ct-57371"]
def parse(self, response):
hxs = Selector(response)
sites = hxs.xpath('//section[#class="rslwrp"]/section')
items = []
for site in sites:
item = DmozItem()
item['title'] = site.xpath('section[2]/section[1]/aside[1]/p[1]/span/a/text()').extract()
items.append(item)
return items
The command I'm using to run the script is:
scrapy crawl myspider -o items.json -t json
Is there any setting which I am not aware of? or the page is not getting loaded fully till scraping. how do i resolve this?
Abhi, here is some code, but please note that it isn't complete and working, it is just to show you the idea. Usually you have to find a next page URL and try to recreate the appropriate request in your spider. In your case AJAX is used. I used FireBug to check which requests are sent by the site.
URL = "http://www.justdial.com/function/ajxsearch.php?national_search=0&...page=%s" # this isn't the complete next page URL
next_page = 2 # how to handle next_page counter is up to you
def parse(self, response):
hxs = Selector(response)
sites = hxs.xpath('//section[#class="rslwrp"]/section')
for site in sites:
item = DmozItem()
item['title'] = site.xpath('section[2]/section[1]/aside[1]/p[1]/span/a/text()').extract()
yield item
# build you pagination URL and send a request
url = self.URL % self.next_page
yield Request(url) # Request is Scrapy request object here
# increment next_page counter if required, make additional
# checks and actions etc
Hope this will help.

Categories