I am running a scraper on a set of domains. The list with domains is provided externally, and a test run with scrapy made clear that for some websites the http prefix is not correctly specified. Some websites throw a DNS error when you try to navigate http://www.example.com instead of http://example.com.
I tried to address this by a for loop that generates for each domain the combinations with the most common prefixes (http://www., http://,https://,https://www.). Yet, I found out that this for some websites results in them being scraped twice (with just all duplicate content), which is not only inefficient from my side, but also not inline with web etiquette.
I am having two questions:
Is this how Scrapy should behave?
How can I prevent this from happening?
I have seen How check if website support http, htts and www prefix with scrapy, but it feels like a detour, it should be part of Scrapy itself.
Related
I am quite new to Scrapy but am designing a web scrape to pull certain information from GoFundMe, specifically in this case the amount of people who have donated to a project. I have written an xpath statement which works fine in Chrome but returns null in Scrapy.
A random example project is https://www.gofundme.com/f/passage/donations, which at present has 22 donations. The below when entered in Chrome inspect gives me "Donations(22)" which is what I need -
//h2[#class="heading-5 mb0"]/text()
However in my Scrapy spider the following yields null -
class DonationsSpider(scrapy.Spider):
name = 'get_donations'
start_urls = [
'https://www.gofundme.com/f/passage/donations'
]
def parse(self, response):
amount_of_donations = response.xpath('//h2[#class="heading-5 mb0"]/text()').extract_first()
yield{
'Donations': amount_of_donations
}
Does anyone know why Scrapy is unable to see this value?
I am doing this in an attempt to find out how many times the rest of the spider needs to loop, as when I hard code this value it works with no problems and yields all of the donations.
Well because there are many requests going on the fulfil the request "https://www.gofundme.com/f/passage/donations". Where
your chrome is smart enough to under stand javascript, using that
smartness it reads the JavaScript code and fetches all the responses
from different different endpoints to fulfil your request
there's one request to the endpoint "https://gateway.gofundme.com/web-gateway/v1/feed/passage/counts" which loads the data you're looking for. which your python script can't do and also it's not recommend.
Instead you can call directly to that api and you'll get the data, good news is that endpoint responds JSON data which is very structured, easy to parse.
and I'm sure you're also looking for the data which is coming from this endpoint "https://gateway.gofundme.com/web-gateway/v1/feed/passage/donations?limit=20&offset=0&sort=recent"
for more information you may refer to one of my blog by clicking here
I really want to know how to find all websites under a certain URL.
For example, I have an URL of https://a.b/c, and I want to find all websites under it such as https://a.b/c/d and https://a.b/c/d/e .
Are there some methods to do this?
Thanks so much!
If the pages are interconnected with hyperlinks from the page at the root, you can easily spider the site by following internal links. This would require you to load the root page, parse its hyperlinks, load those pages and repeat until no new links are detected. You will need to implement cycle detection to avoid crawling pages you have already crawled. Spiders are not trivial to operate politely; many sites expose metadata through robots.txt files or otherwise to indicate which parts of their site they do not wish to be indexed, and they may operate slowly to avoid consuming excessive server resource. You should respect these norms.
However, do note that there is no general purpose way to enumerate all pages if they are not explicitly linked from the site. To do so would require:
that the site enables directory listing, so you can identify all files stored on those paths. Most sites do not provide such a service; or
cooperation with the operator of the site or the web server to find all pages listed under those paths; or
a brute-force search of all possible URLs under those paths, which is an effectively unbounded set. Implementing such a search would not be polite to the operator of the site, is prohibitive in terms of time and effort, and cannot be exhaustive.
Along with #Cosmic Ossifrage's suggestion, you could look for a sitemap. It's often references in the robots.txt found at the root (https://www.example.com/robots.txt). That might have a link to a sitemap xml with a list of links on the site which may or may not be exhaustive.
Use Xenus Link Sleuth, WebCheck and DRKSpider.
Here are the links below
Link Sleuth : http://home.snafu.de/tilman/xenulink.html
WebCheck : https://arthurdejong.org/webcheck/
DRKSpider : http://www.drk.com.ar/spider.php
I have generated a url from product code like,
code: 2555-525
url : www.example.com/2555-525.png
But when fetching a url, it might be a different name format on server,like
www.example.com/2555-525.png
www.example.com/2555-525_TEXT.png
www.example.com/2555-525_TEXT_TEXT.png
Sample code,
urllib2.urlopen(URL).read()
could we pass the url like www.example.com/2555-525*.png ?
Using wildcards in URLs is useless in most cases because
the interpretation of the part of the URL after http://www.example.com/ is totally up to the server - so http://www.example.com/2555-525*.png might have a meaning to the server but but propably has not
normally (exceptions like WebDAV exist) there is no way of listing ressources in a collection or existing URLs in general apart from trying them one-by-one (which is unpractical) or scraping a known site for URLs (which might be incomplete)
For finding and downloading URLs automatically you can use a Web Crawler or Spider.
I'd like to know if is it possible to browse all links in a site (including the parent links and sublinks) using python selenium (example: yahoo.com),
fetch all links in the homepage,
open each one of them
open all the links in the sublinks to three four levels.
I'm using selenium on python.
Thanks
Ala'a
You want "web-scraping" software like Scrapy and possibly Beautifulsoup4 - the first is used to build a program called a "spider" which "crawls" through web pages, extracting structured data from them, and following certain (or all) links in them. BS4 is also for extracting data from web pages, and combined with libraries like requests can be used to build your own spider, though at this point something like Scrapy is probably more relevant to what you need.
There are numerous tutorials and examples out there to help you - just start with the google search I linked above.
Sure it is possible, but you have to instruct selenium to enter these links one by one as you are working within one browser.
In case, the pages are not having the links rendered by JavaScript in the browser, it would be much more efficient to fetch these pages by direct http request and process it this way. In this case I would recommend using requests. However, even with requests it is up to your code to locate all urls in the page and follow up with fetching those pages.
There might be also other Python packages, which are specialized on this kind of task, but here I cannot serve with real experience.
I just got scrapy setup and running and it works great, but I have two (noob) questions. I should say first that I am totally new to scrapy and spidering sites.
Can you limit the number of links crawled? I have a site that doesn't use pagination and just lists a lot of links (which I crawl) on their home page. I feel bad crawling all of those links when I really just need to crawl the first 10 or so.
How do you run multiple spiders at once? Right now I am using the command scrapy crawl example.com, but I also have spiders for example2.com and example3.com. I would like to run all of my spiders using one command. Is this possible?
for #1: Don't use rules attribute to extract links and follow, write your rule in parse function and yield or return Requests object.
for #2: Try scrapyd
Credit goes to Shane, here https://groups.google.com/forum/?fromgroups#!topic/scrapy-users/EyG_jcyLYmU
Using a CloseSpider should allow you to specify limits of this sort.
http://doc.scrapy.org/en/latest/topics/extensions.html#module-scrapy.contrib.closespider
Haven't tried it yet since I didn't need it. Looks like you also might have to enable as an extension (see top of same page) in your settings file.