How to: Get a Python Scrapy to run a simple xpath retrieval - python

I'm very new to python and am trying to build a script that will eventually extract page titles and s from specified URLs to a .csv in the format I specify.
I have tried managed to get the spider to work in CMD using :
response.xpath("/html/head/title/text()").get()
So the xpath must be right.
Unfortunately when I run the the file my spider is in it never seems to work properly. I think the issue is in the final block of code, unfortunately all the guides I follow seem to use CSS. I feel more comfortable with xpath because you can simply copy,paste it from Dev Tools.
import scrapy
class PageSpider(scrapy.Spider):
name = "dorothy"
start_urls = [
"http://www.example.com",
"http://www.example.com/blog"]
def parse(self, response):
for title in response.xpath("/html/head/title/text()"):
yield {
"title": sel.xpath("Title a::text").extract_first()
}
I expected when that to give me the page title of the above URLs.

First of all, your second url on self.start_urls is invalid and returning 404, so you will end up with only one title extracted.
Second, you need to read more about selectors, you extracted the title on your test on shell but got confused when using it on your spider.
Scrapy will call the parse method for each url on self.start_urls, so you don't need to iterate trough titles, you only have one per page.
You also can access the <title> tag directly using // at the beginning of your xpath expression, see this text copied from W3Schools :
/ Selects from the root node
// Selects nodes in the document from the current node that match the selection no matter where they are
Here is the fixed code:
import scrapy
class PageSpider(scrapy.Spider):
name = "dorothy"
start_urls = [
"http://www.example.com"
]
def parse(self, response):
yield {
"title": response.xpath('//title/text()').extract_first()
}

Related

Having trouble with finviz screener tool using Scrapy

Looked over a similar question that involves the same site, but it looks like my problem has different css/html to it. Read through the scrapy tutorial, but still having trouble getting the code to print.
import scrapy
class finvizSpider(scrapy.Spider):
name = "finviz"
start_urls = [
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-change",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=21",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=41",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=61",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=81",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=101",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=121",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=141",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=161",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=181",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=201"]
def parse(self, response):
data = response.xpath('//div[#id="screener-content"]/div/table/tbody').extract()
print(data)
Any help would be much appreciated. Thanks
EDIT:
The answer below allows me to run just one url in the scrapy shell and it switches the original url to the main page without the filters added on.
I am still unable to run this in python. I will attach the output.
There are two issues in your code:
First, Scrapy response does not contains <tbody> elements because it downloads the original HTML page. While, what you see in the browser is the modified HTML which adds <tbody> elements to tables.
Second, you have added an extra div element, //div[#id="screener-content"]/div/table/tbody (talking about bold one).
Try this XPath, '//div[#id="screener-content"]/table//tr/td/table//tr/td//text()' or just execute the below modified code.
Code
import scrapy
class finvizSpider(scrapy.Spider):
name = "finviz"
start_urls = [
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-change",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=21",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=41",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=61",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=81",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=101",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=121",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=141",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=161",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=181",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=201"]
def parse(self, response):
tickers = response.xpath('//a[#class="screener-link-primary"]/text()').extract()
print(tickers)
Output Screenshot

How to extract C code from the HTML using scrapy?

I am using scrapy using anaconda command line, to extract C code from the github site (HTML). It is in the form of
I need to extract the data same as in left side of image.
I used XPATH to extract the required code
import scrapy
class TestCCodeSpider(scrapy.Spider):
name = 'test_c_code'
allowed_domains = ['github.com']
start_urls = ['http://github.com/gouravthakur39/beginners-C-program-examples/blob/master/AllTempScalesConv.c/']
custom_settings={'FEED_URI': "test_c.csv",
'FEED_FORMAT': 'csv'}
def parse(self, response):
print("processing:" +response.url)
notation = response.xpath("//table[#class='highlight tab-size js-file-line-container']/tr/td[#class='blob-code blob-code-inner js-file-line']/text()").extract()
text_td = response.xpath("//table[#class='highlight tab-size js-file-line-container']/tr/td[#class='blob-code blob-code-inner js-file-line']/span/text()").extract()
row_data=zip(notation, text_id)
for i in row_data:
scrapped_info = {
'notation': row_data[0],
'text': row_data[1],
}
yield scrapped_info
When I run the individually XPATHs it gives result right like
response.xpath("//table[#class='highlight tab-size js-file-line-container']/tr/td[#class='blob-code blob-code-inner js-file-line']/text()").extract()
and It gives
While for other extraction
response.xpath("//table[#class='highlight tab-size js-file-line-container']/tr/td[#class='blob-code blob-code-inner js-file-line']/span/text()").extract()
Anybody Can guide me, how can I access the complete code from HTML without facing any distortion in C code.
Thank in advance.
You can achieve using two ways -
A. Use this xpath instead - normalize-space(//div[#itemprop='text']). This gave me the desired result.
B. Crawl following URL instead -
https://raw.githubusercontent.com/gouravthakur39/beginners-C-program-examples/master/AllTempScalesConv.c/. Haven't checked the xpath for this though.
Hope this answers!

Python - Scrapy - Navigating through a website

I’m trying to use Scrapy to log into a website, then navigate within than website, and eventually download data from it. Currently I’m stuck in the middle of the navigation part. Here are the things I looked into to solve the problem on my own.
Datacamp course on Scrapy
Following Pagination Links with Scrapy
http://scrapingauthority.com/2016/11/22/scrapy-login/
Scrapy - Following Links
Relative URL to absolute URL Scrapy
However, I do not seem to connect the dots.
Below is the code I currently use. I manage to log in (when I call the "open_in_browser" function, I see that I’m logged in). I also manage to "click" on the first button on the website in the "parse2" part (if I call "open_in_browser" after parse 2, I see that the navigation bar at the top of the website has gone one level deeper.
The main problem is now in the "parse3" part as I cannot navigate another level deeper (or maybe I can, but the "open_in_browser" does not open the website any more - only if I put it after parse or parse 2). My understanding is that I put multiple "parse-functions" after another to navigate through the website.
Datacamp says I always need to start with a "start request function" which is what I tried but within the YouTube videos, etc. I saw evidence that most start directly with parse functions. Using "inspect" on the website for parse 3, I see that this time href is a relative link and I used different methods (See source 5) to navigate to it as I thought this might be the source of error.
import scrapy
from scrapy.http import FormRequest
from scrapy.utils.response import open_in_browser
from scrapy.crawler import CrawlerProcess
class LoginNeedScraper(scrapy.Spider):
name = "login"
start_urls = ["<some website>"]
def parse(self, response):
loginTicket = response.xpath('/html/body/section/div/div/div/div[2]/form/div[3]/input[1]/#value').extract_first()
execution = response.xpath('/html/body/section/div/div/div/div[2]/form/div[3]/input[2]/#value').extract_first()
return FormRequest.from_response(response, formdata={
'loginTicket': loginTicket,
'execution': execution,
'username': '<someusername>',
'password': '<somepassword>'},
callback=self.parse2)
def parse2(self, response):
next_page_url = response.xpath('/html/body/nav/div[2]/ul/li/a/#href').extract_first()
yield scrapy.Request(url=next_page_url, callback=self.parse3)
def parse3(self, response):
next_page_url_2 = response.xpath('/html//div[#class = "headerPanel"]/div[3]/a/#href').extract_first()
absolute_url = response.urljoin(next_page_url_2)
yield scrapy.Request(url=absolute_url, callback=self.start_scraping)
def start_scraping(self, response):
open_in_browser(response)
process = CrawlerProcess()
process.crawl(LoginNeedScraper)
process.start()
You need to define rules in order to scrape a website completely. Let's say you want to crawl all links in the header of the website and then open that link in order to see the main page to which that link was referring.
In order to achieve this, firstly identify what you need to scrape and mark CSS or XPath selectors for those links and put them in a rule. Every rule has a default callback to parse or you can also assign it to some other method. I am attaching a dummy example of creating rules, and you can map it accordingly to your case:
rules = (
Rule(LinkExtractor(restrict_css=[crawl_css_selectors])),
Rule(LinkExtractor(restrict_css=[product_css_selectors]), callback='parse_item')
)

Capture Google Search Term and ResultStats with Scrapy

I have built a very simple scraper using Scrapy. For the output table, I would like to show the Google News search term as well as the Google resultstats value.
The information I would like to capture is showing in the source of the Google page as
<input class="gsfi" value="Elon Musk">
and
<div id="resultStats">About 52,300 results</div>
I have already tried to include both through ('input.value::text') and ('id.resultstats::text'), which did not work, however. Does anyone have an idea how to solve this situation?
import scrapy
class QuotesSpider(scrapy.Spider):
name = 'quotes'
allowed_domains = ['google.com']
start_urls = ['https://www.google.com/search?q=elon+musk&source=lnt&tbs=cdr%3A1%2Ccd_min%3A1%2F1%2F2015%2Ccd_max%3A12%2F31%2F2015&tbm=nws']
def parse(self, response):
for quote in response.css('div.quote'):
item = {
'search_title': quote.css('input.value::text').extract(),
'results': quote.css('id.resultstats::text').extract(),
}
yield item
The pages renders differently when you access it with Scrapy.
The search field becomes:
response.css('input#sbhost::attr(value)').get()
The results count is:
response.css('#resultStats::text').get()
Also, there is no quote class on that page.
You can test this in the scrapy shell:
scrapy shell -s ROBOTSTXT_OBEY=False "https://www.google.com/search?q=elon+musk&source=lnt&tbs=cdr%3A1%2Ccd_min%3A1%2F1%2F2015%2Ccd_max%3A12%2F31%2F2015&tbm=nws"
And then run those 2 commands.
[EDIT]
If your goal is to get one item for each URL, then you can do this:
def parse(self, response):
item = {
'search_title': response.css('input#sbhost::attr(value)').get(),
'results': response.css('#resultStats::text').get(),
}
yield item
If your goal is to extract every result on the page, then you need something different.

Scraping some Facebook data but not all? Scrapy/Splash/Python

I have a spider that looks like this:
import scrapy
from scrapy_splash import SplashRequest
class BarkbotSpider(scrapy.Spider):
name = 'barkbot'
start_urls = [
'http://www.facebook.com/pg/TheBarkFL/events/?ref=page_internal/'
]
custom_settings = {
'FEED_URI': 'output/barkoutput.json'
}
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(
url,
self.parse,
)
def parse(self, response):
for href in response.css("div#upcoming_events_card a::attr(href)").extract():
yield response.follow(href, self.parse_concert)
def parse_concert(self, response):
concert = {
"headliner" : response.xpath(
"//h1[#id='seo_h1_tag']/text()"
).extract_first(),
"venue" : "The Bark",
"venue_address" : "507 All Saints St.",
"venue_website" : "https://www.facebook.com/TheBarkFL",
"date_time" : response.xpath(
"//li[#id='event_time_info']//text()"
).extract(),
"notes" : response.xpath(
"//div[#data-testid='event-permalink-details']/span/text()"
).extract()
}
if concert['headliner']:
yield concert
I run the spider and it finishes successfully. But all the "notes" and "date_time" keys are returning is empty lists. I'm especially confused on the notes one, as that seems fairly straightforward unless xpath can't use data-testid as an attribute. I am, however, getting the headliner key successfully scraped, so I'm obviously connecting to each page.
I'm new to scraping JavaScript-generated content and thus Splash, but I've managed to get one other spider working successfully, just not on Facebook. What gives?
unless xpath can't use data-testid as an attribute
No, that's not it; I just checked with Scrapy 1.5.1 and your xpath matched a sample document fine. It even matched the other data-testid attributes in that document, so I am pretty sure you've hit a race condition because event-permalink-details does not appear in the HTML; it's loaded from an XHR call to their graphql endpoint. Which in your case may be fine due to Splash, but if your selector isn't matching, then that selector is running before the XHR has resolved. I don't know enough Splash to help troubleshoot that situation.
I don't know the answer to your date_time question, but I actually bet what you really want is .xpath('//li[#id="event_time_info"]//#content') because that contains 2019-01-03T17:30:00-08:00 to 2019-01-03T20:30:00-08:00 which seems much nicer than the blob of strings the unqualified text() matches

Categories