So im trying to execute this code:
liner = 0
for eachLine in content:
print(content[liner].rstrip())
raw=str(content[liner].rstrip())
print("Your site:"+raw)
Sitecheck=requests.get(raw)
time.sleep(5)
var=Sitecheck.text.find('oht945t945iutjkfgiutrhguih4w5t45u9ghdgdirfgh')
time.sleep(5)
print(raw)
liner += 1
I would expect this to run through the first print up to the liner variable and then go back up, however something else seems to happen:
https://google.com
Your site:https://google.com
https://google.com
https://youtube.com
Your site:https://youtube.com
https://youtube.com
https://x.com
Your site:https://x.com
This happens before the get requests. And the get requests later just get timed out:
A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond',))
I tried adding time.sleep(5) in my code for it to run smoother however this failed to yield results
Why don't you use Python's exception handling, to catch failed connections?
import requests
#list with websites
content = ["https://google.com", "https://stackoverflow.com/", "https://bbc.co.uk/", "https://this.site.doesnt.exi.st"]
#get list index and element
for liner, eachLine in enumerate(content):
#not sure, why this line exists, probably necessary for your content list
raw = str(eachLine.rstrip())
#try to get a connection and give feedback, if successful
try:
Sitecheck = requests.get(raw)
print("Tested site #{0}: site {1} responded".format(liner, raw))
except:
print("Tested site #{0}: site {1} seems to be down".format(liner, raw))
Mind you that there are more elaborate ways like scrapy or beautifulsoup in Python to retrieve web content. But I think that your question is more a conceptual than a practical one.
Related
I am trying to create a simple HTTP server that uses the Python HTTPServer which inherits BaseHTTPServer. [https://github.com/python/cpython/blob/main/Lib/http/server.py][1]
There are numerous examples of this approach online and I don't believe I am doing anything unusual.
I am simply importing the class via:
"from http.server import HTTPServer, BaseHTTPRequestHandler"
in my code.
My code overrides the do_GET() method to parse the path variable to determine what page to show.
However, if I start this server and connect to it locally (ex: http://127.0.0.1:50000) the first page loads fine. If I navigate to another page (via my first page links) that too works fine, however, on occasion (and this is somewhat sporadic), there is a delay and the server log shows a Request timed out: timeout('timed out') error. I have tracked this down to the handle_one_request method in the BaseHTTPServer class:
def handle_one_request(self):
"""Handle a single HTTP request.
You normally don't need to override this method; see the class
__doc__ string for information on how to handle specific HTTP
commands such as GET and POST.
"""
try:
self.raw_requestline = self.rfile.readline(65537)
if len(self.raw_requestline) > 65536:
self.requestline = ''
self.request_version = ''
self.command = ''
self.send_error(HTTPStatus.REQUEST_URI_TOO_LONG)
return
if not self.raw_requestline:
self.close_connection = True
return
if not self.parse_request():
# An error code has been sent, just exit
return
mname = 'do_' + self.command ## the name of the method is created
if not hasattr(self, mname): ## checking that we have that method defined
self.send_error(
HTTPStatus.NOT_IMPLEMENTED,
"Unsupported method (%r)" % self.command)
return
method = getattr(self, mname) ## getting that method
method() ## finally calling it
self.wfile.flush() #actually send the response if not already done.
except socket.timeout as e:
# a read or a write timed out. Discard this connection
self.log_error("Request timed out: %r", e)
self.close_connection = True
return
You can see where the exception is thrown in the "except socket.timeout as e:" clause.
I have tried overriding this method by including it in my code but it is not clear what is causing the error so I run into dead ends. I've tried creating very basic HTML pages to see if there was something in the page itself, but even "blank" pages cause the same sporadic issue.
What's odd is that sometimes a page loads instantly, and almost randomly, it will then timeout. Sometimes the same page, sometimes a different page.
I've played with the http.timeout setting, but it makes no difference. I suspect it's some underlying socket issue, but am unable to diagnose it further.
This is on a Mac running Big Sur 11.3.1, with Python version 3.9.4.
Any ideas on what might be causing this timeout, and in particular any suggestions on a resolution. Any pointers would be appreciated.
After further investigation, this particular appears to be an issue with Safari. Running the exact same code and using Firefox does not show the same issue.
This question already has an answer here:
How to reliably check if a domain has been registered or is available?
(1 answer)
Closed 2 years ago.
I have a large list of domains and I need to check if domains are available now. I do it like this:
import requests
list_domain = ['google.com', 'facebook.com']
for domain in list_domain:
result = requests.get(f'http://{domain}', timeout=10)
if result.status_code == 200:
print(f'Domain {domain} [+++]')
else:
print(f'Domain {domain} [---]')
But the check is too slow. Is there a way to make it faster? Maybe someone knows an alternative method for checking domains for existence?
You can use the socket library to determine if a domain has a DNS entry:
>>> import socket
>>>
>>> addr = socket.gethostbyname('google.com')
>>> addr
'74.125.193.100'
>>> socket.gethostbyname('googl42652267e.com')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
socket.gaierror: [Errno -2] Name or service not known
>>>
If you want to check which domains are available, the more correct approach would be to catch the ConnectionError from the requests module, because even if you get a response code that is not 200, the fact that there is a response means that there is a server associated with that domain. Hence, the domain is taken.
This is not full proof in terms of checking for domain availability, because a domain might be taken, but may not have appropriate A record associated with it, or the server may just be down for the time being.
The code below is asynchronous as well.
from concurrent.futures import ThreadPoolExecutor
import requests
from requests.exceptions import ConnectionError
def validate_existence(domain):
try:
response = requests.get(f'http://{domain}', timeout=10)
except ConnectionError:
print(f'Domain {domain} [---]')
else:
print(f'Domain {domain} [+++]')
list_domain = ['google.com', 'facebook.com', 'nonexistent_domain.test']
with ThreadPoolExecutor() as executor:
executor.map(validate_existence, list_domain)
You can do that via "requests-futures" module.
requests-futures runs Asynchronously, If you have a average internet connection it can check 8-10 url per second (Based on my experience).
What you can do is run the script multiple times but add only a limited amount of domains to each to make it speedy.
Use scrapy it is way faster and by default it yields only 200 response until you over ride it so in your case follow me
pip install scrapy
After installing in your project folder user terminal to creat project
Scrapy startproject projectname projectdir
It will create folder name projectdir
Now
cd projectdir
Inside projectdir enter
scrapy genspider mydomain mydomain.com
Now navigate to spiders folder open mydomain.py
Now add few lines of code
import scrapy
class MydomainSpider(scrapy.Spider):
name = "mydomain"
def start_requests(self):
urls = [
'facebook.com',
'Google.com',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
yield { ‘Available_Domains’ : response.url}
Now back to projectdir and run
scrapy crawl mydomain -o output.csv
You will have all the working domains having status code 200 in output.csv file
For more see
I have a program that makes a web request. I protect against an invalid request by doing the following:
import time
for i in range(5):
val = makeRequest()
if val is None:
time.sleep(i*5)
else:
break
My question: is this a good way to protect against invalid web requests (i.e. connection error, server error, etc.)? In particular, is my choice of increasing time and 5 seconds good?
Edit: the function makeRequest() returns None upon an exception.
I'm a novice Python programmer trying to use Python to scrape a large amount of pages from fanfiction.net and deposit a particular line of the page's HTML source into a .csv file. My program works fine, but eventually hits a snag where it stops running. My IDE told me that the program has encountered "Errno 10054: an existing connection was forcibly closed by the remote host".
I'm looking for a way to get my code to reconnect and continue every time I get the error. My code will be scraping a few hundred thousand pages every time it runs; is this maybe just too much for the site? The site doesn't appear to prevent scraping. I've done a fair amount of research on this problem already and attempted to implement a retry decorator, but the decorator doesn't seem to work. Here's the relevant section of my code:
def retry(ExceptionToCheck, tries=4, delay=3, backoff=2, logger=None):
def deco_retry(f):
#wraps(f)
def f_retry(*args, **kwargs):
mtries, mdelay = tries, delay
while mtries > 1:
try:
return f(*args, **kwargs)
except ExceptionToCheck as e:
msg = "%s, Retrying in %d seconds..." % (str(e), mdelay)
if logger:
logger.warning(msg)
else:
print(msg)
time.sleep(mdelay)
mtries -= 1
mdelay *= backoff
return f(*args, **kwargs)
return f_retry # true decorator
return deco_retry
#retry(urllib.error.URLError, tries=4, delay=3, backoff=2)
def retrieveURL(URL):
response = urllib.request.urlopen(URL)
return response
def main():
# first check: 5000 to 100,000
MAX_ID = 600000
ID = 400001
URL = "http://www.fanfiction.net/s/" + str(ID) + "/index.html"
fCSV = open('buffyData400k600k.csv', 'w')
fCSV.write("Rating, Language, Genre 1, Genre 2, Character A, Character B, Character C, Character D, Chapters, Words, Reviews, Favorites, Follows, Updated, Published, Story ID, Story Status, Author ID, Author Name" + '\n')
while ID <= MAX_ID:
URL = "http://www.fanfiction.net/s/" + str(ID) + "/index.html"
response = retrieveURL(URL)
Whenever I run the .py file outside of my IDE, it eventually locks up and stops grabbing new pages after about an hour, tops. I'm also running a different version of the same file in my IDE, and that appears to have been running for almost 12 hours now, if not longer-is it possible that the file could work in my IDE but not when run independently?
Have I set my decorator up wrong? What else could I potentially do to get python to reconnect? I've also seen claims that the SQL native client being out of date could cause problems for a Window user such as myself - is this true? I've tried to update that but had no luck.
Thank you!
You are catching URLErrors, which Errno: 10054 is not, so your #retry decorator is not going to retry. Try this.
#retry(Exception, tries=4)
def retrieveURL(URL):
response = urllib.request.urlopen(URL)
return response
This should retry 4 times on any Exception. Your #retry decorator is defined correctly.
Your code for reconnecting looks good except for one part - the exception that you're trying to catch. According to this StackOverflow question, an Errno 10054 is a socket.error. All you need to do is to import socket and add an except socket.error statement in your retry handler.
I am using Python 2.7 in order to perform a simple task of launching a browser, verify the header, and close the browser
#Launch the browser # google
new = 0
url = "http://www.google.com/"
webbrowser.open(url, new=new)
#Check for the header
conn = httplib.HTTPConnection("www.google.com")
conn.request("HEAD", "/")
r1 = conn.getresponse()
#Close the browser
os.system("taskkill /im iexplore.exe")
This just runs on an infinite loop in order to verify continuous connectivity. Ping check isn't sufficient for the amount of traffic I need, or I would use that.
My problem is that if I do lose connectivity, the script freezes and I get addressinfo errors. How do I ignore this, or recognize it, kill the browser and keep the script running?
Sorry, if I'm not doing this right...it is my first post.
I don't think you actually need a browser here at all.
Meanwhile, the way you ignore or recognize errors is with a try statement. So:
while True:
try:
conn = httplib.HTTPConnection("www.google.com")
conn.request("HEAD", "/")
r1 = conn.getresponse()
if not my_verify_response_func(r1):
print('Headers are wrong!')
except Exception as e:
print('Failed to check headers with {}'.format(e))
time.sleep(60) # I doubt you want to run as fast as possible