How to scrape data from multiple pages using Scrapy? - python

I'm trying to scrape data from multiple pages using Scrapy. I'musing the code below, what am I doing wrong?
import scrapy
class CollegeSpider(scrapy.Spider):
name = 'college'
allowed_domains = ['https://engineering.careers360.com/colleges/list-of-engineering-colleges-in-India?sort_filter=alpha']
start_urls = ['https://engineering.careers360.com/colleges/list-of-engineering-colleges-in-India?sort_filter=alpha/']
def parse(self,response):
for college in response.css('div.title'):
if college.css('a::text').extract_first():
yield {'college_name':college.css('a::text').extract_first()}
next_page_url=response.css('li.page-next>a::attr(href)').extract_first()
next_page_url=response.urljoin(next_page_url)
yield scrapy.Request(url=next_page_url,callback=self.praise)

Why do you think you are doing something wrong? Does it show any error? If so, the output should be included in the question in the first place. If it's not doing what you expected, again, you should tell us.
Anyway, looking at the code, there are at least two possible errors:
allowed_domains should be just a domain name, not full URL, as documented.
when you yield new Request to the next page, you should give callback=self.parse instead of self.praise to process the response the same way as the first URL

Related

Having trouble with finviz screener tool using Scrapy

Looked over a similar question that involves the same site, but it looks like my problem has different css/html to it. Read through the scrapy tutorial, but still having trouble getting the code to print.
import scrapy
class finvizSpider(scrapy.Spider):
name = "finviz"
start_urls = [
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-change",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=21",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=41",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=61",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=81",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=101",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=121",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=141",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=161",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=181",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=201"]
def parse(self, response):
data = response.xpath('//div[#id="screener-content"]/div/table/tbody').extract()
print(data)
Any help would be much appreciated. Thanks
EDIT:
The answer below allows me to run just one url in the scrapy shell and it switches the original url to the main page without the filters added on.
I am still unable to run this in python. I will attach the output.
There are two issues in your code:
First, Scrapy response does not contains <tbody> elements because it downloads the original HTML page. While, what you see in the browser is the modified HTML which adds <tbody> elements to tables.
Second, you have added an extra div element, //div[#id="screener-content"]/div/table/tbody (talking about bold one).
Try this XPath, '//div[#id="screener-content"]/table//tr/td/table//tr/td//text()' or just execute the below modified code.
Code
import scrapy
class finvizSpider(scrapy.Spider):
name = "finviz"
start_urls = [
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-change",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=21",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=41",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=61",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=81",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=101",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=121",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=141",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=161",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=181",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=201"]
def parse(self, response):
tickers = response.xpath('//a[#class="screener-link-primary"]/text()').extract()
print(tickers)
Output Screenshot

extract_first only resulting in first item only .extract() does not work

I hope that you're all well. I am trying to learn Python through web-scraping. My project at the moment is to scrape data from a games store. I am initially wanting to follow a product link and print the response from each link. There are 60 game links on the page that I wish for Scrapy to follow. Below is the code.
import scrapy
class GameSpider(scrapy.Spider):
name = 'spider'
allowed_domains = ['365games.co.uk']
start_urls = ['https://www.365games.co.uk/3ds-games/']
def parse(self, response):
all_games = response.xpath('//*[#id="product_grid"]')
for game in all_games:
game_url = game.xpath('.//h3/a/#href').extract_first()
yield scrapy.Request(game_url, callback=self.parse_game)
def parse_game(self, response):
print(response.status)
When I run this code scrapy runs and goes through the first link and prints the response, but stops. When I change the code to .extract() I get the following,
TypeError: Request url must be str or unicode, got list
The same applies with .get()/.getall() being that .get() only returns the first and .getall() displays the above error.
Any help would be greatly appreciated, but please be gentle I am trying to learn.
Thanks in advance and best regards,
Gav
The error is saying that you are passing a list instead of a string to scrapy.Request. This tells us that game_url is actually a list, when you want a string. You are very close to the right thing here, but I believe your problem is that you are looping in the wrong place. You first XPath returns just a single item, rather than a list of items. It is within this that you want to find your game_urls leading to
def parse(self, response):
product_grid = response.xpath('//*[#id="product_grid"]')
for game_url in product_grid.xpath('.//h3/a/#href').getall():
yield scrapy.Request(game_url, callback=self.parse_game)
You could also combine your xpath queries to directly to
def parse(self, response):
all_games = response.xpath('//*[#id="product_grid"]//h3/a/#href')
for game_url in all_games.getall():
yield scrapy.Request(game_url, callback=self.parse_game)
In this case you could also use follow instead of creating a new Request directly. You can even directly pass a selector rather than a string to follow so you don't need to getall(), and it knows how to deal with <a> elements so you don't need the #href either!
def parse(self, response):
for game_url in response.xpath('//*[#id="product_grid"]//h3/a'):
yield response.follow(game_url, callback=self.parse_game)

attribute error during recursive scraping with scrapy

I have a scrapy spider that works well as long as I give it a page that contains the links to the pages that it should scrape.
Now I want to not give it all the categories but the page that contains links to all categories.
I thought I could simply add another parse function in order to achive this.
but the console output gives me an attribute error
"attributeError: 'zaubersonder' object has no attribute
'parsedetails'"
This tells me that some attribute refference is not working correctly.
I am new to object orientation but I thought scarpy is calling parse which is calling prase_level2 wich in turn calls parse_details and this should work fine.
below is my effort so far.
import scrapy
class zaubersonder(scrapy.Spider):
name = 'zaubersonder'
allowed_domains = ['abc.de']
start_urls = ['http://www.abc.de/index.php/rgergegregre.html'
]
def parse(self, response):
urls = response.css('a.ulSubMenu::attr(href)').extract() # links to categories
for url in urls:
url = response.urljoin(url)
yield scrapy.Request(url=url,callback=self.parse_level2)
def parse_level2(self, response):
urls2 = response.css('a.ulSubMenu::attr(href)').extract() # links to entries
for url2 in urls2:
url2 = response.urljoin(url2)
yield scrapy.Request(url=url2,callback=self.parse_details)
def parse_details(self,response): #extract entries
yield {
"Titel": response.css("li.active.last::text").extract(),
"Content": response.css('div.ce_text.first.block').extract() + response.css('div.ce_text.last.block').extract(),
}
edit: fixed the code in case someone will search for it
There is a typo in the code. The callback in parse_level2 is self.parsedetails, but the function is named parse_details.
Just change the yield in parse_level2 to:
yield scrapy.Request(url=url2,callback=self.parse_details)
..and it should work better.

Scrapy Linkextractor duplicating(?)

I have the crawler implemented as below.
It is working and it would go through sites regulated under the link extractor.
Basically what I am trying to do is to extract information from different places in the page:
- href and text() under the class 'news' ( if exists)
- image url under the class 'think block' ( if exists)
I have three problems for my scrapy:
1) duplicating linkextractor
It seems that it will duplicate processed page. ( I check against the export file and found that the same ~.img appeared many times while it is hardly possible)
And the fact is , for every page in the website, there are hyperlinks at the bottom that facilitate users to direct to the topic they are interested in, while my objective is to extract information from the topic's page ( here listed several passages's title under the same topic ) and the images found within a passage's page( you can arrive to the passage's page by clicking on the passage's title found at topic page).
I suspect link extractor would loop the same page over again in this case.
( maybe solve with depth_limit?)
2) Improving parse_item
I think it is quite not efficient for parse_item. How could I improve it? I need to extract information from different places in the web ( for sure it only extracts if it exists).Beside, it looks like that the parse_item could only progress HkejImage but not HkejItem (again I checked with the output file). How should I tackle this?
3) I need the spiders to be able to read Chinese.
I am crawling a site in HK and it would be essential to be capable to read Chinese.
The site:
http://www1.hkej.com/dailynews/headline/article/1105148/IMF%E5%82%B3%E4%BF%83%E4%B8%AD%E5%9C%8B%E9%80%80%E5%87%BA%E6%95%91%E5%B8%82
As long as it belongs to 'dailynews', that's the thing I want.
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.http import Request, FormRequest
from scrapy.contrib.linkextractors import LinkExtractor
import items
class EconjournalSpider(CrawlSpider):
name = "econJournal"
allowed_domains = ["hkej.com"]
login_page = 'http://www.hkej.com/template/registration/jsp/login.jsp'
start_urls = 'http://www.hkej.com/dailynews'
rules=(Rule(LinkExtractor(allow=('dailynews', ),unique=True), callback='parse_item', follow =True),
)
def start_requests(self):
yield Request(
url=self.login_page,
callback=self.login,
dont_filter=True
)
# name column
def login(self, response):
return FormRequest.from_response(response,
formdata={'name': 'users', 'password': 'my password'},
callback=self.check_login_response)
def check_login_response(self, response):
"""Check the response returned by a login request to see if we are
successfully logged in.
"""
if "username" in response.body:
self.log("\n\n\nSuccessfully logged in. Let's start crawling!\n\n\n")
return Request(url=self.start_urls)
else:
self.log("\n\n\nYou are not logged in.\n\n\n")
# Something went wrong, we couldn't log in, so nothing happens
def parse_item(self, response):
hxs = Selector(response)
news=hxs.xpath("//div[#class='news']")
images=hxs.xpath('//p')
for image in images:
allimages=items.HKejImage()
allimages['image'] = image.xpath('a/img[not(#data-original)]/#src').extract()
yield allimages
for new in news:
allnews = items.HKejItem()
allnews['news_title']=new.xpath('h2/#text()').extract()
allnews['news_url'] = new.xpath('h2/#href').extract()
yield allnews
Thank you very much and I would appreciate any help!
First, to set settings, make it on the settings.py file or you can specify the custom_settings parameter on the spider, like:
custom_settings = {
'DEPTH_LIMIT': 3,
}
Then, you have to make sure the spider is reaching the parse_item method (which I think it doesn't, haven't tested yet). And also you can't specify the callback and follow parameters on a rule, because they don't work together.
First remove the follow on your rule, or add another rule, to check which links to follow, and which links to return as items.
Second on your parse_item method, you are getting incorrect xpath, to get all the images, maybe you could use something like:
images=hxs.xpath('//img')
and then to get the image url:
allimages['image'] = image.xpath('./#src').extract()
for the news, it looks like this could work:
allnews['news_title']=new.xpath('.//a/text()').extract()
allnews['news_url'] = new.xpath('.//a/#href').extract()
Now, as and understand your problem, this isn't a Linkextractor duplicating error, but only poor rules specifications, also make sure you have valid xpath, because your question didn't indicate you needed xpath correction.

Scrapy - no list page, but I know the url for each item page

I'm using Scrapy to scrape a website. The item page that I want to scrape looks like: http://www.somepage.com/itempage/&page=x. Where x is any number from 1 to 100. Thus, I have an SgmlLinkExractor Rule with a callback function specified for any page resembling this.
The website does not have a listpage with all the items, so I want to somehow well scrapy to scrape those urls (from 1 to 100). This guy here seemed to have the same issue, but couldn't figure it out.
Does anyone have a solution?
You could list all the known URLs in your Spider class' start_urls attribute:
class SomepageSpider(BaseSpider):
name = 'somepage.com'
allowed_domains = ['somepage.com']
start_urls = ['http://www.somepage.com/itempage/&page=%s' % page for page in xrange(1, 101)]
def parse(self, response):
# ...
If it's just a one time thing, you can create a local html file file:///c:/somefile.html with all the links. Start scraping that file and add somepage.com to allowed domains.
Alternately, in the parse function, you can return a new Request which is the next url to be scraped.

Categories