How to quickly check if domain exists? [duplicate] - python

This question already has an answer here:
How to reliably check if a domain has been registered or is available?
(1 answer)
Closed 2 years ago.
I have a large list of domains and I need to check if domains are available now. I do it like this:
import requests
list_domain = ['google.com', 'facebook.com']
for domain in list_domain:
result = requests.get(f'http://{domain}', timeout=10)
if result.status_code == 200:
print(f'Domain {domain} [+++]')
else:
print(f'Domain {domain} [---]')
But the check is too slow. Is there a way to make it faster? Maybe someone knows an alternative method for checking domains for existence?

You can use the socket library to determine if a domain has a DNS entry:
>>> import socket
>>>
>>> addr = socket.gethostbyname('google.com')
>>> addr
'74.125.193.100'
>>> socket.gethostbyname('googl42652267e.com')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
socket.gaierror: [Errno -2] Name or service not known
>>>

If you want to check which domains are available, the more correct approach would be to catch the ConnectionError from the requests module, because even if you get a response code that is not 200, the fact that there is a response means that there is a server associated with that domain. Hence, the domain is taken.
This is not full proof in terms of checking for domain availability, because a domain might be taken, but may not have appropriate A record associated with it, or the server may just be down for the time being.
The code below is asynchronous as well.
from concurrent.futures import ThreadPoolExecutor
import requests
from requests.exceptions import ConnectionError
def validate_existence(domain):
try:
response = requests.get(f'http://{domain}', timeout=10)
except ConnectionError:
print(f'Domain {domain} [---]')
else:
print(f'Domain {domain} [+++]')
list_domain = ['google.com', 'facebook.com', 'nonexistent_domain.test']
with ThreadPoolExecutor() as executor:
executor.map(validate_existence, list_domain)

You can do that via "requests-futures" module.
requests-futures runs Asynchronously, If you have a average internet connection it can check 8-10 url per second (Based on my experience).

What you can do is run the script multiple times but add only a limited amount of domains to each to make it speedy.

Use scrapy it is way faster and by default it yields only 200 response until you over ride it so in your case follow me
pip install scrapy
After installing in your project folder user terminal to creat project
Scrapy startproject projectname projectdir
It will create folder name projectdir
Now
cd projectdir
Inside projectdir enter
scrapy genspider mydomain mydomain.com
Now navigate to spiders folder open mydomain.py
Now add few lines of code
import scrapy
class MydomainSpider(scrapy.Spider):
name = "mydomain"
def start_requests(self):
urls = [
'facebook.com',
'Google.com',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
yield { ‘Available_Domains’ : response.url}
Now back to projectdir and run
scrapy crawl mydomain -o output.csv
You will have all the working domains having status code 200 in output.csv file
For more see

Related

How do I make scrapy work on Google Functions using Flask & Scrapy

Hiya I've been following this tutorial:
https://weautomate.org/articles/running-scrapy-spider-cloud-function/
I'm trying to get this web scraper to run in the cloud and be able to recieve post requests with the postcode which will cause a search of the post codes and return a list of addresses to the response.
Currently I just have this main.py file and requirements.txt file with scrapy and flask in it.
from multiprocessing import Process, Queue
from flask import Flask, jsonify, request
import scrapy
from scrapy.crawler import CrawlerProcess
app = Flask(__name__)
#app.route('/start_scrape', methods=['POST'])
def start_scrape(request):
postcode = request.get_json()['postcode']
start_urls = [f'https://find-energy-certificate.service.gov.uk/find-a-certificate/search-by-postcode?postcode={postcode}']
addresses = []
class AddressesSpider(scrapy.Spider):
name = 'Addresses'
allowed_domains = ['find-energy-certificate.service.gov.uk']
start_urls = start_urls
def parse(self, response):
for row in response.xpath('//table[#class="govuk-table"]//tr'):
address = row.xpath("normalize-space(.//a[#class='govuk-link']/text())").extract()[0].lower()
address = address.rsplit(',', 2)[0]
link = row.xpath('.//a[#class="govuk-link"]/#href').extract()
details = row.xpath("normalize-space(.//td/following-sibling::td)").extract()
item = {
'link': link,
'details': details,
'address': address
}
addresses.append(item)
process = scrapy.crawler.CrawlerProcess(settings={
'ROBOTSTXT_OBEY': False
})
process.crawl(AddressesSpider)
process.start()
return jsonify(addresses)
def my_cloud_function(event, context):
def script(queue):
try:
settings = scrapy.settings.Settings()
settings.setdict({
'ROBOTSTXT_OBEY': False
})
process = CrawlerProcess(settings)
process.crawl(AddressesSpider)
process.start()
queue.put(None)
except Exception as e:
queue.put(e)
queue = Queue()
main_process = Process(target=script, args=(queue,))
main_process.start()
main_process.join()
result = queue.get()
if result is not None:
raise result
return 'ok'
I'm getting a few errors from this script when it first launches but it does successfully compile:
TypeError: start_scrape() takes 0 positional arguments but 1 was given
.view_func ( /layers/google.python.pip/pip/lib/python3.10/site-packages/functions_framework/init.py:99 )
2
MissingTargetException: File /workspace/main.py is expected to contain a function named /start_scrape
.get_user_function ( /layers/google.python.pip/pip/lib/python3.10/site-packages/functions_framework/_function_registry.py:41 )
1
NameError: name 'start_urls' is not defined
.AddressesSpider ( /workspace/main.py:16 )
1
ModuleNotFoundError: No module named 'scrapy'
. ( /workspace/main.py:3 )
When sending a curl post with -H "Authorization: bearer $(gcloud auth print-identity-token)"
-H "Content-Type: application/json"
-d '{"postcode": "OX4+1EU"}'
I get a 500 error. Any help to fix this issue would be great
I've been trying to run this scrapy, I was expecting it to return a list of addresses in a json format when sending a postcode. Currently it seems to do nothing.
In your code, you create the scrapy process, you perform a start and then you return the jsonify HTTP response.
Therefore, the scrapy process runs in backgrount -> That's your problem.
Indeed, with Cloud Functions (cloud run and AppEngine standard) you are charged only when the HTTP request is processed. Outside the request processing, because you are not charged, the CPU is throttled. After 15 minutes (with Cloud Run, I think it's 30 minutes with Cloud Functions 1st gen) without any request processing, the instance is offloaded
And so, you can't run anything (or very slowly) in background. If it's too slow, the instance is removed before the end of the scrap.
To solve that, you can use Cloud Run (or Cloud Functions 2nd generation, it's the same thing) with min instance option, or no CPU throttled option if the processing takes less than 15 minutes.
Of course, more CPU usage incurs additional fees!

I want to create a Scarpy with Tornado. Where the USER can enter the URL to search and get the Result in UI

I am completely new to Python. I learned scrapy few days before only.
I want to create a Scarpy with Tornado or some other Python setup. Where the USER can enter the URL to crawl and get the Result in UI.
I tried scrapyrt where the user gets the result as JSON in UI. But I can't able to use the JSON.
I also tried archnado.
https://github.com/TeamHG-Memex/arachnado
But it's old version which is not supporting now. When I tried its throwing lots of errors.
Also tried this https://groups.google.com/forum/#!topic/python-tornado/vi7idvzOgU8
Bitbucket project. It's throwing an error. Can someone please help with this by providing detailed steps to implement.
Given you just recently learned Python + Scrapy, it would be best for you to learn more about Python because embedding Scrapy in a web server leverages some Python magic. On the other hand, this kind of question pops up from time to time and it should probably be addressed in some form.
What you have to remember is that Scrapy is built using the Twisted framework so you have to make sure whatever web framework you're using has Twisted integration. Luckily for you, Tornado integrates well with Twisted.
import tornado.platform.twisted
tornado.platform.twisted.install()
import scrapy
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging as scrapy_log_conf
from tornado import ioloop, web
from tornado.log import enable_pretty_logging as enable_tornado_log
class QuotesSpider(scrapy.Spider):
name = 'quotes'
start_urls = []
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').extract_first(),
'author': quote.xpath('span/small/text()').extract_first(),
}
next_page = response.css('li.next a::attr("href")').extract_first()
if next_page is not None:
yield response.follow(next_page, self.parse)
class RESTfulInterface(web.RequestHandler):
def post(self):
# get url to crawl from user
crawl_urls = self.get_arguments('crawl_url')
if len(crawl_urls) == 0:
self.send_error(400)
raise web.Finish
crawl_runner = CrawlerRunner()
#deferred = crawl_runner.crawl(QuotesSpider, start_urls=['http://quotes.toscrape.com/tag/humor/'])
deferred = crawl_runner.crawl(QuotesSpider, start_urls=crawl_urls)
deferred.addBoth(self.crawl_complete)
def crawl_complete(self, result):
"""
Do something meaningful after the crawl is complete, like sending an
email to all the admins.
"""
print('CRAWL_COMPLETE')
def main():
enable_tornado_log()
scrapy_log_conf({'LOG_FORMAT': '%(levelname)s: %(message)s'})
app = web.Application([
(r'/crawl/?', RESTfulInterface),
])
app.listen(port='8888')
ioloop.IOLoop.current().start()
main()
This snippet will scrape pages from http://quotes.toscrape.com. The crawl is initiated when you POST to /crawl endpoint.
`curl -X POST -F 'crawl_url=http://quotes.toscrape.com/tag/humor/' -F 'crawl_url=hello' http://localhost:8888/crawl`

for loop skipping over code?

So im trying to execute this code:
liner = 0
for eachLine in content:
print(content[liner].rstrip())
raw=str(content[liner].rstrip())
print("Your site:"+raw)
Sitecheck=requests.get(raw)
time.sleep(5)
var=Sitecheck.text.find('oht945t945iutjkfgiutrhguih4w5t45u9ghdgdirfgh')
time.sleep(5)
print(raw)
liner += 1
I would expect this to run through the first print up to the liner variable and then go back up, however something else seems to happen:
https://google.com
Your site:https://google.com
https://google.com
https://youtube.com
Your site:https://youtube.com
https://youtube.com
https://x.com
Your site:https://x.com
This happens before the get requests. And the get requests later just get timed out:
A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond',))
I tried adding time.sleep(5) in my code for it to run smoother however this failed to yield results
Why don't you use Python's exception handling, to catch failed connections?
import requests
#list with websites
content = ["https://google.com", "https://stackoverflow.com/", "https://bbc.co.uk/", "https://this.site.doesnt.exi.st"]
#get list index and element
for liner, eachLine in enumerate(content):
#not sure, why this line exists, probably necessary for your content list
raw = str(eachLine.rstrip())
#try to get a connection and give feedback, if successful
try:
Sitecheck = requests.get(raw)
print("Tested site #{0}: site {1} responded".format(liner, raw))
except:
print("Tested site #{0}: site {1} seems to be down".format(liner, raw))
Mind you that there are more elaborate ways like scrapy or beautifulsoup in Python to retrieve web content. But I think that your question is more a conceptual than a practical one.

python nose and twisted

I am writing a test for a function that downloads the data from an url with Twisted (I know about twisted.web.client.getPage, but this one adds some extra functionality). Either ways, I want to use nosetests since I am using it throughout the project and it doesn't look appropriate to use Twisted Trial only for this particular test.
So what I am trying to do is something like:
from nose.twistedtools import deferred
#deferred()
def test_download(self):
url = 'http://localhost:8000'
d = getPage(url)
def callback(data):
assert len(data) != 0
d.addCallback(callback)
return d
On localhost:8000 listens a test server. The issue is I always get twisted.internet.error.DNSLookupError
DNSLookupError: DNS lookup failed: address 'localhost:8000' not found: [Errno -5] No address associated with hostname.
Is there a way I can fix this? Does anyone actually uses nose.twistedtools?
Update: A more complete traceback
Traceback (most recent call last):
File "/usr/local/lib/python2.6/dist-packages/nose-0.11.2-py2.6.egg/nose/twistedtools.py", line 138, in errback
failure.raiseException()
File "/usr/local/lib/python2.6/dist-packages/Twisted-9.0.0-py2.6-linux-x86_64.egg/twisted/python/failure.py", line 326, in raiseException
raise self.type, self.value, self.tb
DNSLookupError: DNS lookup failed: address 'localhost:8000' not found: [Errno -5] No address associated with hostname.
Update 2
My bad, it seems in the implementation of getPage, I was doing something like:
obj = urlparse.urlparse(url)
netloc = obj.netloc
and passing netloc to the the factory when I should've passed netloc.split(':')[0]
Are you sure your getPage function is parsing the URL correctly? The error message seems to suggest that it is using the hostname and port together when doing the dns lookup.
You say your getPage is similar to twisted.web.client.getPage, but that works fine for me when I use it in this complete script:
#!/usr/bin/env python
from nose.twistedtools import deferred
from twisted.web import client
import nose
#deferred()
def test_download():
url = 'http://localhost:8000'
d = client.getPage(url)
def callback(data):
assert len(data) != 0
d.addCallback(callback)
return d
if __name__ == "__main__":
args = ['--verbosity=2', __file__]
nose.run(argv=args)
While running a simple http server in my home directory:
$ python -m SimpleHTTPServer
Serving HTTP on 0.0.0.0 port 8000 ...
The nose test gives the following output:
.
----------------------------------------------------------------------
Ran 1 test in 0.019s
OK

Python Package For Multi-Threaded Spider w/ Proxy Support?

Instead of just using urllib does anyone know of the most efficient package for fast, multithreaded downloading of URLs that can operate through http proxies? I know of a few such as Twisted, Scrapy, libcurl etc. but I don't know enough about them to make a decision or even if they can use proxies.. Anyone know of the best one for my purposes? Thanks!
is's simple to implement this in python.
The urlopen() function works
transparently with proxies which do
not require authentication. In a Unix
or Windows environment, set the
http_proxy, ftp_proxy or gopher_proxy
environment variables to a URL that
identifies the proxy server before
starting the Python interpreter
# -*- coding: utf-8 -*-
import sys
from urllib import urlopen
from BeautifulSoup import BeautifulSoup
from Queue import Queue, Empty
from threading import Thread
visited = set()
queue = Queue()
def get_parser(host, root, charset):
def parse():
try:
while True:
url = queue.get_nowait()
try:
content = urlopen(url).read().decode(charset)
except UnicodeDecodeError:
continue
for link in BeautifulSoup(content).findAll('a'):
try:
href = link['href']
except KeyError:
continue
if not href.startswith('http://'):
href = 'http://%s%s' % (host, href)
if not href.startswith('http://%s%s' % (host, root)):
continue
if href not in visited:
visited.add(href)
queue.put(href)
print href
except Empty:
pass
return parse
if __name__ == '__main__':
host, root, charset = sys.argv[1:]
parser = get_parser(host, root, charset)
queue.put('http://%s%s' % (host, root))
workers = []
for i in range(5):
worker = Thread(target=parser)
worker.start()
workers.append(worker)
for worker in workers:
worker.join()
usually proxies filter websites categorically based on how the website was created. It is difficult to transmit data through proxies based on categories. Eg youtube is classified as audio/video streams therefore youtube is blocked in some places espically schools.
If you want to bypass proxies and get the data off a website and put it in your own genuine website like a dot com website that can be registered it to you.
When you are making and registering the website categorise your website as anything you want.

Categories