Looked over a similar question that involves the same site, but it looks like my problem has different css/html to it. Read through the scrapy tutorial, but still having trouble getting the code to print.
import scrapy
class finvizSpider(scrapy.Spider):
name = "finviz"
start_urls = [
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-change",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=21",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=41",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=61",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=81",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=101",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=121",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=141",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=161",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=181",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=201"]
def parse(self, response):
data = response.xpath('//div[#id="screener-content"]/div/table/tbody').extract()
print(data)
Any help would be much appreciated. Thanks
EDIT:
The answer below allows me to run just one url in the scrapy shell and it switches the original url to the main page without the filters added on.
I am still unable to run this in python. I will attach the output.
There are two issues in your code:
First, Scrapy response does not contains <tbody> elements because it downloads the original HTML page. While, what you see in the browser is the modified HTML which adds <tbody> elements to tables.
Second, you have added an extra div element, //div[#id="screener-content"]/div/table/tbody (talking about bold one).
Try this XPath, '//div[#id="screener-content"]/table//tr/td/table//tr/td//text()' or just execute the below modified code.
Code
import scrapy
class finvizSpider(scrapy.Spider):
name = "finviz"
start_urls = [
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-change",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=21",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=41",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=61",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=81",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=101",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=121",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=141",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=161",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=181",
"https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=-ticker&r=201"]
def parse(self, response):
tickers = response.xpath('//a[#class="screener-link-primary"]/text()').extract()
print(tickers)
Output Screenshot
I am assigned to create a crawler by using python and scrapy to get the reviews of a specific hotel. I read quite a number of tutorials and guides, but still my code just keeps generating an empty CSV file.
Item.py
import scrapy
class AgodaItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
StarRating = scrapy.Field()
Title = scrapy.Field()
Comments = scrapy.Field()
Agoda_reviews.py
import scrapy
class AgodaReviewsSpider(scrapy.Spider):
name = 'agoda_reviews'
allowed_domains = ['agoda.com']
start_urls = ['https://www.agoda.com/holiday-inn-express-kuala-lumpur-city-centre/hotel/kuala-lumpur-my.html?checkIn=2020-04-14&los=1&adults=2&rooms=1&searchrequestid=41af11cc-eaa6-42cc-874d-383761d3523c&travellerType=1&tspTypes=9']
def parse(self, response):
StarRating=response.xpath('//span[#class="Review-comment-leftScore"]/span/text()').extract()
Title=response.xpath('//span[#class="Review-comment-bodyTitle"]/span/text()').extract()
Comments=response.xpath('//span[#class="Review-comment-bodyText"]/span/text()').extract()
count = 0
for item in zip(StarRating, Title, Comments):
# create a dictionary to store the scraped info
scraped_data = {
'StarRating': item[0],
'Title': item[1],
'Comments': item[2],
}
# yield or give the scraped info to scrapy
yield scraped_data
Can anybody please kindly let me know where the problems are? I am totally clueless...
Your results are empty because scrapy is receiving a response that does not have a lot of content. You can see this by starting a scrapy shell from your terminal and sending a request to the page you are trying to crawl.
scrapy shell 'https://www.agoda.com/holiday-
inn-express-kuala-lumpur-city-centre/hotel/kuala-lumpur-my.html?checkIn=2020-04-14&los=1&adults=2&rooms=1&searchrequestid=41af11cc
-eaa6-42cc-874d-383761d3523c&travellerType=1&tspTypes=9'
Then you can view the response that scrapy received by running:
view(response)
That should open the response that was received and stored by scrapy in your browser. As you should see, there are no reviews to extract from.
Also, as you are trying to extract some information from span-elements, you can run response.css('span').extract() and you will see that there are some span-elements in the response but none of them has a class that has anything to do with Reviews.
So to sum up, agoda is sending you a quite empty response. As a consequence scrapy is extracting empty lists. Possible reasons could be: Agoda has figured out that you are trying to crawl their website, for example based on your user agent, and is therefore hiding the content from you - or they are using javascript to generate the content.
To solve your problem you should either use the agoda api, make yourself familiar with user agent spoofing or check out the selenium package which might help with javascript-heavy websites.
For a classification project I need the raw html content of roughly 1000 websites. I only need the landing page and not more, so the crawler does not have to follow links! I want to use scrapy for it but I can't get the code together. Because I read in the documentation that JSON Files are first stored in memory and then saved (which can cause Problems when crawling a large number of pages) I want to save the file in the '.js' format. I use the Anaconda promt to execute my code.
I want the resulting file to have two columns, one with the domainname, and the second with the raw_html content on every site
domain, html_raw
..., ...
..., ...
I found many Spider Examples but I cant figure out how to put everything together. This is how far I got :(
Start the Project:
scrapy startproject dragonball
The actuall Spider (which might be completely wrong):
import scrapy
class DragonSpider(scrapy.Spider):
name = "dragonball"
def start_requests(self):
urls = [
'https://www.faz.de',
'https://www.spiegel.de',
'https://www.stern.de',
'https://www.brandeins.de',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
???
I navigate to the dragonball Folder and execute the file with:
scrapy crawl dragonball -o dragonball.jl
Every help would be apreciated :)
I advice you to store html to files and write name of files to csv. It will be easier to keep data in format domain, html_raw.
You can download files with common with open('%s.html' % domain, 'w') as f: f.write(response.body) or download them with FILES pipeline, check documentation here: https://docs.scrapy.org/en/latest/topics/media-pipeline.html
Domain you can get with:
from urllib.parse import urlparse
domain = urlparse(response.url).netloc
If you really want to store everything in a single file, then you can use the following (including part of vezunchik's answer):
def parse(self, response):
yield {
'domain': urlparse(response.url).netloc,
'html_raw': response.body.decode('utf-8'),
}
As mentioned, this isn't a good idea in the long run as you'll end up with a huge file.
I am new to Scrapy and Python and as such I am a beginner. I want to be able to have Scrapy read a text file with a seed list of around 100k urls, have Scrapy visit each URL, and extract all external URLs (URLs of Other Sites) found on each of those Seed URLs and export the results to a separate text file.
Scrapy should only visit the URLs in the text file, not spider out and follow any other URL.
I want to be able to have Scrapy work as fast as possible, I have a very powerful server with a 1GBS line. Each URL in my list is from a unique domain, so I won't be hitting any 1 site hard at all and thus won't be encountering IP blocks.
How would I go about creating a project in Scrapy to be able to extract all external links from a list of urls stored in a textfile?
Thanks.
You should use:
1. start_requests function for reading list of urls.
2. css or xpath selector for all "a" html elements.
from scrapy import Spider
class YourSpider(Spider):
name = "your_spider"
def start_requests(self):
with open('your_input.txt', 'r') as f: # read the list of urls
for url in f.readlines() # process each of them
yield Request(url, callback=self.parse)
def parse(self, response):
item = YourItem(parent_url=response.url)
item['child_urls'] = response.css('a::attr(href)').extract()
return item
More info about start_requests here:
http://doc.scrapy.org/en/latest/topics/spiders.html#scrapy.spiders.Spider.start_requests
For extracting scraped items to another file use Item Pipeline or Feed Export. Basic pipeline example here:
http://doc.scrapy.org/en/latest/topics/item-pipeline.html#write-items-to-a-json-file
I am new to python scrapy and trying to get through a small example, however I am having some problems!
I am able to crawl the first given URL only, but I am unable to crawl more than one page or an entire website for that matter!
Please help me or give me some advice on how I can crawl an entire website or more pages in general...
The example I am doing is very simple...
My items.py
import scrapy
class WikiItem(scrapy.Item):
title = scrapy.Field()
my wikip.py (the spider)
import scrapy
from wiki.items import WikiItem
class CrawlSpider(scrapy.Spider):
name = "wikip"
allowed_domains = ["en.wikipedia.org/wiki/"]
start_urls = (
'http://en.wikipedia.org/wiki/Portal:Arts',
)
def parse(self, response):
for sel in response.xpath('/html'):
item = WikiItem()
item['title'] = sel.xpath('//h1[#id="firstHeading"]/text()').extract()
yield item
When I run scrapy crawl wikip -o data.csv in the root project diretory the result is:
title
Portal:Arts
Can anyone give me insight as to why it is not following urls and crawling deeper?
I have checked some related SO questions but they have not helped to solve the issue
scrapy.Spider is the simplest spider. Change the name CrawlSpider, since Crawl Spider is one of the generic spiders of scrapy.
One of the below option can be used:
eg: 1. class WikiSpider(scrapy.Spider)
or 2. class WikiSpider(CrawlSpider)
If you are using first option you need to code the logic for following the links you need to follow on that webpage.
For second option you can do the below:
After the start urls you need to define the rule as below:
rules = (
Rule(LinkExtractor(allow=('https://en.wikipedia.org/wiki/Portal:Arts\?.*?')), callback='parse_item', follow=True,),
)
Also please change the name of the function defined as "parse" if you use CrawlSpider. The Crawl Spider uses parse method to implement the logic. Thus, here you are trying to override the parse method and hence the crawl spider doesn't work.