I am a Python novice and am trying to write a script to extract the data from this page. Using scrapy, I wrote the following code:
import scrapy
class dairySpider(scrapy.Spider):
name = "dairy_price"
def start_requests(self):
urls = [
'http://www.dairy.com/market-prices/?page=quote&sym=DAH15&mode=i',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
for rows in response.xpath("//tr"):
yield {
'text': rows.xpath(".//td/text()").extract().strip('. \n'),
}
However, this didn't scrape anything. Do you have any ideas ?
Thanks
The table on the page http://www.dairy.com/market-prices/?page=quote&sym=DAH15&mode=i is being dynamically added to the DOM by making request to http://shared.websol.barchart.com/quotes/quote.php?page=quote&sym=DAH15&mode=i&domain=blimling&display_ice=&enabled_ice_exchanges=&tz=0&ed=0.
You should be scrapping the second link instead of first. As scrapy.Request will only return html source code and not the content added using javascript.
UPDATE
Here is the working code for extracting table data
import scrapy
class dairySpider(scrapy.Spider):
name = "dairy_price"
def start_requests(self):
urls = [
"http://shared.websol.barchart.com/quotes/quote.php?page=quote&sym=DAH15&mode=i&domain=blimling&display_ice=&enabled_ice_exchanges=&tz=0&ed=0",
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
for row in response.css(".bcQuoteTable tbody tr"):
print row.xpath("td//text()").extract()
Make sure you edit your settings.py file and change ROBOTSTXT_OBEY = True to ROBOTSTXT_OBEY = False
Related
This problem is starting to frustrate me very much as I feel like I have no clue how scrapy works and that I can't wrap my head around the documentation.
My question is simple. I have the most standard of spiders.
class MySpider(scrapy.spider):
def start_requests(self):
header = ..
url = "www.website.whatever/search/filter1=.../&page=1"
test = scrapy.Request(url=url, callback=self.parse, headers = header)
def parse(self, response):
site_number_of_pages = int(response.xpath(..))
return site_number_of_pages
I just want to somehow get the number of pages from the parse function back into the start requests function so I can start a for loop to go through all the pages on the website, using the same parse function again. The code above does illustrated the principle only but would not work if put in practice. Variable test would be a Request class and not my plain Joe integer that I want.
How would I accomplish what I am trying to do?
EDIT:
This is what I have tried up till now
class MySpider(scrapy.spider):
def start_requests(self):
header = ..
url = ..
yield scrapy.Request(url=url, callback=self.parse, headers = header)
def parse(self, response):
header = ..
site_number_of_pages = int(response.xpath(..))
for count in range(2,site_number_of_pages):
url = url + str(count)
yield scrapy.Request(url=url, callback=self.parse, headers = header)
Scrapy is asynchronous framework. Here is no any possibility to.. return to start_urls - only Requests followed by it's callbacks.
On general case if Requests appeared as result of some response parsing (on your case - site_number_of_pages from first url) - it is not start_requests
The easiest thing You can do in this case - is to yield requests from parse method.
def parse(self, response):
site_number_of_pages = int(response.xpath(..))
for i in range(site_number_of_pages):
...
yield Request(url=...
Instead of grabbing the number of pages and looping through all of them, I grabbed the "next page" feature of the web page. So everytime self.parse is activated it will grab the next page and call itself again. This will go on until there is no next page and it will just error out.
class MySpider(scrapy.spider):
def start_requests(self):
header = ..
url = "www.website.whatever/search/filter1=.../&page=1"
yield scrapy.Request(url=url, callback=self.parse, headers = header)
def parse(self, response):
header = ..
..
next_page = response.xpath(..)
url = "www.website.whatever/search/filter1=.../&page=" + next_page
yield scrapy.Request(url=url, callback=self.parse, headers = header)
I have created a spider as you would see below. I can get links from homepage but when I want to use them in function scrapy doesn't follow the links. I don't get any http or server error from source.
class GamerSpider(scrapy.Spider):
name = 'gamer'
allowed_domains = ['eurogamer.net']
start_urls = ['http://www.eurogamer.net/archive/ps4']
def parse(self, response):
for link in response.xpath("//h2"):
link=link.xpath(".//a/#href").get()
content=response.xpath("//div[#class='details']/p/text()").get()
yield response.follow(url=link, callback=self.parse_game, meta={'url':link,'content':content})
next_page = 'http://www.eurogamer.net'+response.xpath("//div[#class='buttons forward']/a[#class='button next']/#href").get()
if next_page:
yield scrapy.Request(url=next_page, callback=self.parse)
def parse_game(self, response):
url=response.request.meta['url']
#some things to get
rows=response.xpath("//main")
for row in rows:
#some things to get
yield{
'url':url
#some things to get
}
Any help?
Hi guys I am very new in scraping data, I have tried the basic one. But my problem is I have 2 web page with same domain that I need to scrape
My Logic is,
First page www.sample.com/view-all.html
*This page open all the list of items and I need to get all the href attr of every item.
Second page www.sample.com/productpage.52689.html
*this is the link came from the first page so 52689 needs to change dynamically depending on the link provided by the first page.
I need to get all the data like title, description etc on the second page.
what I am thinking is for loop but Its not working on my end. I am searching on google but no one has the same problem as mine. please help me
import scrapy
class SalesItemSpider(scrapy.Spider):
name = 'sales_item'
allowed_domains = ['www.sample.com']
start_urls = ['www.sample.com/view-all.html', 'www.sample.com/productpage.00001.html']
def parse(self, response):
for product_item in response.css('li.product-item'):
item = {
'URL': product_item.css('a::attr(href)').extract_first(),
}
yield item`
Inside parse you can yield Request() with url and function's name to scrape this url in different function
def parse(self, response):
for product_item in response.css('li.product-item'):
url = product_item.css('a::attr(href)').extract_first()
# it will send `www.sample.com/productpage.52689.html` to `parse_subpage`
yield scrapy.Request(url=url, callback=self.parse_subpage)
def parse_subpage(self, response):
# here you parse from www.sample.com/productpage.52689.html
item = {
'title': ...,
'description': ...
}
yield item
Look for Request in Scrapy documentation and its tutorial
There is also
response.follow(url, callback=self.parse_subpage)
which will automatically add www.sample.com to urls so you don't have to do it on your own in
Request(url = "www.sample.com/" + url, callback=self.parse_subpage)
See A shortcut for creating Requests
If you interested in scraping then you should read docs.scrapy.org from first page to the last one.
I am trying to follow a list of links and scrap data from each link with a simple scrapy spider but I am having trouble.
In the scrapy shell when I recreate the script it sends the get request of the new url but when I run the crawl I do not get any data back from the link. The only data I get back is from the starting url that was scraped before going to the link.
How do I scrape data from the link?
import scrapy
class QuotesSpider(scrapy.Spider):
name = "players"
start_urls = ['http://wiki.teamliquid.net/counterstrike/Portal:Teams']
def parse(self, response):
teams = response.xpath('//*[#id="mw-content-text"]/table[1]')
for team in teams.css('span.team-template-text'):
yield{
'teamName': team.css('a::text').extract_first()
}
urls = teams.css('span.team-template-text a::attr(href)')
for url in urls:
url = response.urljoin(url)
yield scrapy.Request(url, callback=self.parse_team_info)
def parse_team_info(self, response):
yield{
'Test': response.css('span::text').extract_first()
}
Instead of using
url = response.urljoin(url)
yield scrapy.Request(url, callback=self.parse_team_info)
use
yield response.follow(url, callback=self.parse_team_info)
I have inspected element from chrome:
I want to get data in the red box (can be more than one) using scrapy. I used this code (I see the tutorial from the scrapy documentation):
import scrapy
class KamusSetSpider(scrapy.Spider):
name = "kamusset_spider"
start_urls = ['http://kbbi.web.id/' + 'abad']
def parse(self, response):
for kamusset in response.css("div#d1"):
text = kamusset.css("div.sub_17 b.tur.highlight::text").extract()
print(dict(text=text))
But, there is no result:
What happen? I have change it to this(use splash) but still not working:
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, self.parse, args={'wait': 0.5})
def parse(self, response):
html = response.body
for kamusset in response.css("div#d1"):
text = kamusset.css("div.sub_17 b.tur.highlight::text").extract()
print(dict(text=text))
In this case it seems that the page content is generated dynamically --
eventhough you can see the elements present when inspecting from browser, they are not present in the HTML source (i.e. in what Scrapy sees). That's because Scrapy can't render JavaScript etc. You need to use some kind of browser to render the page and then put the result to Scrapy for processing. I recommend using Splash for it's seamless integration with Scrapy.