Having trouble with finviz screener tool using Scrapy - python

Looked over a similar question that involves the same site, but it looks like my problem has different css/html to it. Read through the scrapy tutorial, but still having trouble getting the code to print.
import scrapy
class finvizSpider(scrapy.Spider):
name = "finviz"
start_urls = [
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-change",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=21",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=41",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=61",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=81",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=101",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=121",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=141",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=161",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=181",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=201"]
def parse(self, response):
data = response.xpath('//div[#id="screener-content"]/div/table/tbody').extract()
print(data)
Any help would be much appreciated. Thanks
EDIT:
The answer below allows me to run just one url in the scrapy shell and it switches the original url to the main page without the filters added on.
I am still unable to run this in python. I will attach the output.

There are two issues in your code:
First, Scrapy response does not contains <tbody> elements because it downloads the original HTML page. While, what you see in the browser is the modified HTML which adds <tbody> elements to tables.
Second, you have added an extra div element, //div[#id="screener-content"]/div/table/tbody (talking about bold one).
Try this XPath, '//div[#id="screener-content"]/table//tr/td/table//tr/td//text()' or just execute the below modified code.
Code
import scrapy
class finvizSpider(scrapy.Spider):
name = "finviz"
start_urls = [
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-change",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=21",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=41",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=61",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=81",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=101",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=121",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=141",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=161",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=181",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=201"]
def parse(self, response):
tickers = response.xpath('//a[#class="screener-link-primary"]/text()').extract()
print(tickers)
Output Screenshot

Related

Python - Scrapy - Navigating through a website

I’m trying to use Scrapy to log into a website, then navigate within than website, and eventually download data from it. Currently I’m stuck in the middle of the navigation part. Here are the things I looked into to solve the problem on my own.
Datacamp course on Scrapy
Following Pagination Links with Scrapy
http://scrapingauthority.com/2016/11/22/scrapy-login/
Scrapy - Following Links
Relative URL to absolute URL Scrapy
However, I do not seem to connect the dots.
Below is the code I currently use. I manage to log in (when I call the "open_in_browser" function, I see that I’m logged in). I also manage to "click" on the first button on the website in the "parse2" part (if I call "open_in_browser" after parse 2, I see that the navigation bar at the top of the website has gone one level deeper.
The main problem is now in the "parse3" part as I cannot navigate another level deeper (or maybe I can, but the "open_in_browser" does not open the website any more - only if I put it after parse or parse 2). My understanding is that I put multiple "parse-functions" after another to navigate through the website.
Datacamp says I always need to start with a "start request function" which is what I tried but within the YouTube videos, etc. I saw evidence that most start directly with parse functions. Using "inspect" on the website for parse 3, I see that this time href is a relative link and I used different methods (See source 5) to navigate to it as I thought this might be the source of error.
import scrapy
from scrapy.http import FormRequest
from scrapy.utils.response import open_in_browser
from scrapy.crawler import CrawlerProcess
class LoginNeedScraper(scrapy.Spider):
name = "login"
start_urls = ["<some website>"]
def parse(self, response):
loginTicket = response.xpath('/html/body/section/div/div/div/div[2]/form/div[3]/input[1]/#value').extract_first()
execution = response.xpath('/html/body/section/div/div/div/div[2]/form/div[3]/input[2]/#value').extract_first()
return FormRequest.from_response(response, formdata={
'loginTicket': loginTicket,
'execution': execution,
'username': '<someusername>',
'password': '<somepassword>'},
callback=self.parse2)
def parse2(self, response):
next_page_url = response.xpath('/html/body/nav/div[2]/ul/li/a/#href').extract_first()
yield scrapy.Request(url=next_page_url, callback=self.parse3)
def parse3(self, response):
next_page_url_2 = response.xpath('/html//div[#class = "headerPanel"]/div[3]/a/#href').extract_first()
absolute_url = response.urljoin(next_page_url_2)
yield scrapy.Request(url=absolute_url, callback=self.start_scraping)
def start_scraping(self, response):
open_in_browser(response)
process = CrawlerProcess()
process.crawl(LoginNeedScraper)
process.start()
You need to define rules in order to scrape a website completely. Let's say you want to crawl all links in the header of the website and then open that link in order to see the main page to which that link was referring.
In order to achieve this, firstly identify what you need to scrape and mark CSS or XPath selectors for those links and put them in a rule. Every rule has a default callback to parse or you can also assign it to some other method. I am attaching a dummy example of creating rules, and you can map it accordingly to your case:
rules = (
Rule(LinkExtractor(restrict_css=[crawl_css_selectors])),
Rule(LinkExtractor(restrict_css=[product_css_selectors]), callback='parse_item')
)

How to: Get a Python Scrapy to run a simple xpath retrieval

I'm very new to python and am trying to build a script that will eventually extract page titles and s from specified URLs to a .csv in the format I specify.
I have tried managed to get the spider to work in CMD using :
response.xpath("/html/head/title/text()").get()
So the xpath must be right.
Unfortunately when I run the the file my spider is in it never seems to work properly. I think the issue is in the final block of code, unfortunately all the guides I follow seem to use CSS. I feel more comfortable with xpath because you can simply copy,paste it from Dev Tools.
import scrapy
class PageSpider(scrapy.Spider):
name = "dorothy"
start_urls = [
"http://www.example.com",
"http://www.example.com/blog"]
def parse(self, response):
for title in response.xpath("/html/head/title/text()"):
yield {
"title": sel.xpath("Title a::text").extract_first()
}
I expected when that to give me the page title of the above URLs.
First of all, your second url on self.start_urls is invalid and returning 404, so you will end up with only one title extracted.
Second, you need to read more about selectors, you extracted the title on your test on shell but got confused when using it on your spider.
Scrapy will call the parse method for each url on self.start_urls, so you don't need to iterate trough titles, you only have one per page.
You also can access the <title> tag directly using // at the beginning of your xpath expression, see this text copied from W3Schools :
/ Selects from the root node
// Selects nodes in the document from the current node that match the selection no matter where they are
Here is the fixed code:
import scrapy
class PageSpider(scrapy.Spider):
name = "dorothy"
start_urls = [
"http://www.example.com"
]
def parse(self, response):
yield {
"title": response.xpath('//title/text()').extract_first()
}

How to scrape data from multiple pages using Scrapy?

I'm trying to scrape data from multiple pages using Scrapy. I'musing the code below, what am I doing wrong?
import scrapy
class CollegeSpider(scrapy.Spider):
name = 'college'
allowed_domains = ['https://engineering.careers360.com/colleges/list-of-engineering-colleges-in-India?sort_filter=alpha']
start_urls = ['https://engineering.careers360.com/colleges/list-of-engineering-colleges-in-India?sort_filter=alpha/']
def parse(self,response):
for college in response.css('div.title'):
if college.css('a::text').extract_first():
yield {'college_name':college.css('a::text').extract_first()}
next_page_url=response.css('li.page-next>a::attr(href)').extract_first()
next_page_url=response.urljoin(next_page_url)
yield scrapy.Request(url=next_page_url,callback=self.praise)
Why do you think you are doing something wrong? Does it show any error? If so, the output should be included in the question in the first place. If it's not doing what you expected, again, you should tell us.
Anyway, looking at the code, there are at least two possible errors:
allowed_domains should be just a domain name, not full URL, as documented.
when you yield new Request to the next page, you should give callback=self.parse instead of self.praise to process the response the same way as the first URL

Using scrapy I get an empty item

I want to crawl some information from a webpage using python and scrapy, but when I try to do it the output of my item is empty...
First of all I've started a new project with scrapy. Then I've written the following in the items.py file:
import scrapy
class KakerlakeItem(scrapy.Item):
info=scrapy.Field()
pass
Next, I've created a new file in the spider's folder with the following code:
import scrapy
from kakerlake.items import KakerlakeItem
class Kakerlakespider(scrapy.Spider):
name='Coco'
allowed_domains=['http://www.goeuro.es/']
start_urls=['http://www.goeuro.es/search/NTYzY2U2Njk4YzA1ZDoyNzE2OTU4ODM=']
def parse(self, response):
item=KakerlakeItem()
item['info']=response.xpath('//span[#class= "inline-b height-100"]/text()').extract()
#yield item
return item
I expect, by writing scrapy crawl Coco -o data.json in the console, that I will get what I want, but instead of this I obtain the json file with {'info': []}. That is, an empty item.
I've tried a lot of things and I don't know why it doesn't work correctly...
Your xpath is invalid for the page, as there isn't a single class with "inline-b" or "height-100". This page is heavily modified via Javascript, so what you see in a browser will not be representative of what Scrapy receives.
xpath results:
>>> response.xpath('//span[contains(#class, "inline-b")]')
[]
>>> response.xpath('//span[contains(#class, "height-100")]')
[]
Remove the pass in KakerlakeItem(scrapy.Item) ?

Scrapy spider not showing whole result

Hi all I an trying to get whole results from the given link in the code. but my code not giving all results. This link says it contain 2132 results but it returns only 20 results.:
from scrapy.spider import Spider
from scrapy.selector import Selector
from tutorial.items import Flipkart
class Test(Spider):
name = "flip"
allowed_domains = ["flipkart.com"]
start_urls = ["http://www.flipkart.com/mobiles/pr?sid=tyy,4io& otracker=ch_vn_mobile_filter_Mobile%20Brands_All"
]
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//div[#class="pu-details lastUnit"]')
items = []
for site in sites:
item = Flipkart()
item['title'] = site.xpath('div[1]/a/text()').extract()
items.append(item)
return items**
That is because the site only shows 20 results at a time, and loading of more results is done with JavaScript when the user scrolls to the bottom of the page.
You have two options here:
Find a link on the site which shows all results on a single page (doubtful it exists, but some sites may do so when passed an optional query string, for example).
Handle JavaScript events in your spider. The default Scrapy downloader doesn't do this, so you can either analyze the JS code and send the event signals yourself programmatically or use something like Selenium w/ PhantomJS to let the browser deal with it. I'd recommend the latter since it's more fail-proof than the manual approach of interpreting the JS yourself. See this question for more information, and Google around, there's plenty of information on this topic.

Categories