Looked over a similar question that involves the same site, but it looks like my problem has different css/html to it. Read through the scrapy tutorial, but still having trouble getting the code to print.
import scrapy
class finvizSpider(scrapy.Spider):
name = "finviz"
start_urls = [
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-change",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=21",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=41",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=61",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=81",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=101",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=121",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=141",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=161",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=181",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=201"]
def parse(self, response):
data = response.xpath('//div[#id="screener-content"]/div/table/tbody').extract()
print(data)
Any help would be much appreciated. Thanks
EDIT:
The answer below allows me to run just one url in the scrapy shell and it switches the original url to the main page without the filters added on.
I am still unable to run this in python. I will attach the output.
There are two issues in your code:
First, Scrapy response does not contains <tbody> elements because it downloads the original HTML page. While, what you see in the browser is the modified HTML which adds <tbody> elements to tables.
Second, you have added an extra div element, //div[#id="screener-content"]/div/table/tbody (talking about bold one).
Try this XPath, '//div[#id="screener-content"]/table//tr/td/table//tr/td//text()' or just execute the below modified code.
Code
import scrapy
class finvizSpider(scrapy.Spider):
name = "finviz"
start_urls = [
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-change",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=21",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=41",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=61",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=81",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=101",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=121",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=141",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=161",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=181",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=201"]
def parse(self, response):
tickers = response.xpath('//a[#class="screener-link-primary"]/text()').extract()
print(tickers)
Output Screenshot
I hope that you're all well. I am trying to learn Python through web-scraping. My project at the moment is to scrape data from a games store. I am initially wanting to follow a product link and print the response from each link. There are 60 game links on the page that I wish for Scrapy to follow. Below is the code.
import scrapy
class GameSpider(scrapy.Spider):
name = 'spider'
allowed_domains = ['365games.co.uk']
start_urls = ['https://www.365games.co.uk/3ds-games/']
def parse(self, response):
all_games = response.xpath('//*[#id="product_grid"]')
for game in all_games:
game_url = game.xpath('.//h3/a/#href').extract_first()
yield scrapy.Request(game_url, callback=self.parse_game)
def parse_game(self, response):
print(response.status)
When I run this code scrapy runs and goes through the first link and prints the response, but stops. When I change the code to .extract() I get the following,
TypeError: Request url must be str or unicode, got list
The same applies with .get()/.getall() being that .get() only returns the first and .getall() displays the above error.
Any help would be greatly appreciated, but please be gentle I am trying to learn.
Thanks in advance and best regards,
Gav
The error is saying that you are passing a list instead of a string to scrapy.Request. This tells us that game_url is actually a list, when you want a string. You are very close to the right thing here, but I believe your problem is that you are looping in the wrong place. You first XPath returns just a single item, rather than a list of items. It is within this that you want to find your game_urls leading to
def parse(self, response):
product_grid = response.xpath('//*[#id="product_grid"]')
for game_url in product_grid.xpath('.//h3/a/#href').getall():
yield scrapy.Request(game_url, callback=self.parse_game)
You could also combine your xpath queries to directly to
def parse(self, response):
all_games = response.xpath('//*[#id="product_grid"]//h3/a/#href')
for game_url in all_games.getall():
yield scrapy.Request(game_url, callback=self.parse_game)
In this case you could also use follow instead of creating a new Request directly. You can even directly pass a selector rather than a string to follow so you don't need to getall(), and it knows how to deal with <a> elements so you don't need the #href either!
def parse(self, response):
for game_url in response.xpath('//*[#id="product_grid"]//h3/a'):
yield response.follow(game_url, callback=self.parse_game)
I have a scrapy spider that works well as long as I give it a page that contains the links to the pages that it should scrape.
Now I want to not give it all the categories but the page that contains links to all categories.
I thought I could simply add another parse function in order to achive this.
but the console output gives me an attribute error
"attributeError: 'zaubersonder' object has no attribute
'parsedetails'"
This tells me that some attribute refference is not working correctly.
I am new to object orientation but I thought scarpy is calling parse which is calling prase_level2 wich in turn calls parse_details and this should work fine.
below is my effort so far.
import scrapy
class zaubersonder(scrapy.Spider):
name = 'zaubersonder'
allowed_domains = ['abc.de']
start_urls = ['http://www.abc.de/index.php/rgergegregre.html'
]
def parse(self, response):
urls = response.css('a.ulSubMenu::attr(href)').extract() # links to categories
for url in urls:
url = response.urljoin(url)
yield scrapy.Request(url=url,callback=self.parse_level2)
def parse_level2(self, response):
urls2 = response.css('a.ulSubMenu::attr(href)').extract() # links to entries
for url2 in urls2:
url2 = response.urljoin(url2)
yield scrapy.Request(url=url2,callback=self.parse_details)
def parse_details(self,response): #extract entries
yield {
"Titel": response.css("li.active.last::text").extract(),
"Content": response.css('div.ce_text.first.block').extract() + response.css('div.ce_text.last.block').extract(),
}
edit: fixed the code in case someone will search for it
There is a typo in the code. The callback in parse_level2 is self.parsedetails, but the function is named parse_details.
Just change the yield in parse_level2 to:
yield scrapy.Request(url=url2,callback=self.parse_details)
..and it should work better.
I have the crawler implemented as below.
It is working and it would go through sites regulated under the link extractor.
Basically what I am trying to do is to extract information from different places in the page:
- href and text() under the class 'news' ( if exists)
- image url under the class 'think block' ( if exists)
I have three problems for my scrapy:
1) duplicating linkextractor
It seems that it will duplicate processed page. ( I check against the export file and found that the same ~.img appeared many times while it is hardly possible)
And the fact is , for every page in the website, there are hyperlinks at the bottom that facilitate users to direct to the topic they are interested in, while my objective is to extract information from the topic's page ( here listed several passages's title under the same topic ) and the images found within a passage's page( you can arrive to the passage's page by clicking on the passage's title found at topic page).
I suspect link extractor would loop the same page over again in this case.
( maybe solve with depth_limit?)
2) Improving parse_item
I think it is quite not efficient for parse_item. How could I improve it? I need to extract information from different places in the web ( for sure it only extracts if it exists).Beside, it looks like that the parse_item could only progress HkejImage but not HkejItem (again I checked with the output file). How should I tackle this?
3) I need the spiders to be able to read Chinese.
I am crawling a site in HK and it would be essential to be capable to read Chinese.
The site:
http://www1.hkej.com/dailynews/headline/article/1105148/IMF%E5%82%B3%E4%BF%83%E4%B8%AD%E5%9C%8B%E9%80%80%E5%87%BA%E6%95%91%E5%B8%82
As long as it belongs to 'dailynews', that's the thing I want.
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.http import Request, FormRequest
from scrapy.contrib.linkextractors import LinkExtractor
import items
class EconjournalSpider(CrawlSpider):
name = "econJournal"
allowed_domains = ["hkej.com"]
login_page = 'http://www.hkej.com/template/registration/jsp/login.jsp'
start_urls = 'http://www.hkej.com/dailynews'
rules=(Rule(LinkExtractor(allow=('dailynews', ),unique=True), callback='parse_item', follow =True),
)
def start_requests(self):
yield Request(
url=self.login_page,
callback=self.login,
dont_filter=True
)
# name column
def login(self, response):
return FormRequest.from_response(response,
formdata={'name': 'users', 'password': 'my password'},
callback=self.check_login_response)
def check_login_response(self, response):
"""Check the response returned by a login request to see if we are
successfully logged in.
"""
if "username" in response.body:
self.log("\n\n\nSuccessfully logged in. Let's start crawling!\n\n\n")
return Request(url=self.start_urls)
else:
self.log("\n\n\nYou are not logged in.\n\n\n")
# Something went wrong, we couldn't log in, so nothing happens
def parse_item(self, response):
hxs = Selector(response)
news=hxs.xpath("//div[#class='news']")
images=hxs.xpath('//p')
for image in images:
allimages=items.HKejImage()
allimages['image'] = image.xpath('a/img[not(#data-original)]/#src').extract()
yield allimages
for new in news:
allnews = items.HKejItem()
allnews['news_title']=new.xpath('h2/#text()').extract()
allnews['news_url'] = new.xpath('h2/#href').extract()
yield allnews
Thank you very much and I would appreciate any help!
First, to set settings, make it on the settings.py file or you can specify the custom_settings parameter on the spider, like:
custom_settings = {
'DEPTH_LIMIT': 3,
}
Then, you have to make sure the spider is reaching the parse_item method (which I think it doesn't, haven't tested yet). And also you can't specify the callback and follow parameters on a rule, because they don't work together.
First remove the follow on your rule, or add another rule, to check which links to follow, and which links to return as items.
Second on your parse_item method, you are getting incorrect xpath, to get all the images, maybe you could use something like:
images=hxs.xpath('//img')
and then to get the image url:
allimages['image'] = image.xpath('./#src').extract()
for the news, it looks like this could work:
allnews['news_title']=new.xpath('.//a/text()').extract()
allnews['news_url'] = new.xpath('.//a/#href').extract()
Now, as and understand your problem, this isn't a Linkextractor duplicating error, but only poor rules specifications, also make sure you have valid xpath, because your question didn't indicate you needed xpath correction.
I'm using Scrapy to scrape a website. The item page that I want to scrape looks like: http://www.somepage.com/itempage/&page=x. Where x is any number from 1 to 100. Thus, I have an SgmlLinkExractor Rule with a callback function specified for any page resembling this.
The website does not have a listpage with all the items, so I want to somehow well scrapy to scrape those urls (from 1 to 100). This guy here seemed to have the same issue, but couldn't figure it out.
Does anyone have a solution?
You could list all the known URLs in your Spider class' start_urls attribute:
class SomepageSpider(BaseSpider):
name = 'somepage.com'
allowed_domains = ['somepage.com']
start_urls = ['http://www.somepage.com/itempage/&page=%s' % page for page in xrange(1, 101)]
def parse(self, response):
# ...
If it's just a one time thing, you can create a local html file file:///c:/somefile.html with all the links. Start scraping that file and add somepage.com to allowed domains.
Alternately, in the parse function, you can return a new Request which is the next url to be scraped.