The scanner works until it finds an external address that is no longer available and then crashes .
I just want to scan only herold.at and extract the email addresses.
I want him to stop scanning outside addresses. I tried with
r = requests.get ('http://github.com', allow_redirects = False) but does not work.
import csv
import requests
import re
import time
from bs4 import BeautifulSoup
# Number of pages plus one
allLinks = [];mails=[];
url = 'https://www.herold.at/gelbe-seiten/wien/was_installateur/?page='
for page in range(3):
time.sleep(5)
print('---', page, '---')
response = requests.get(url + str(page), timeout=1.001)
soup=BeautifulSoup(response.text,'html.parser')
links = [a.attrs.get('href') for a in soup.select('a[href]') ]
for i in links:
#time.sleep(15)
if(("Kontakt" in i or "Porträt")):
allLinks.append(i)
allLinks=set(allLinks)
def findMails(soup):
#time.sleep(15)
for name in soup.find_all("a", "ellipsis"):
if(name is not None):
emailText=name.text
match=bool(re.match('[a-zA-Z0-9-_.]+#[a-zA-Z0-9-_.]+',emailText))
if('#' in emailText and match==True):
emailText=emailText.replace(" ",'').replace('\r','')
emailText=emailText.replace('\n','').replace('\t','')
if(len(mails)==0)or(emailText not in mails):
print(emailText)
mails.append(emailText)
for link in allLinks:
if(link.startswith("http") or link.startswith("www")):
r=requests.get(link)
data=r.text
soup=BeautifulSoup(data,'html.parser')
findMails(soup)
else:
newurl=url+link
r=requests.get(newurl)
data=r.text
soup=BeautifulSoup(data,'html.parser')
findMails(soup)
mails=set(mails)
if(len(mails)==0):
print("NO MAILS FOUND")
Error:
requests.exceptions.ConnectionError: HTTPConnectionPool(host='www.gebrueder-lamberger.at', port=80): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x0000021A24AA7308>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond'))
The error is in this line if(link.startswith("http") or link.startswith("www")): change the http into https and it should work. I tried it and it fetched all emails.
--- 0 ---
--- 1 ---
--- 2 ---
office#smutny-installationen.at
office#offnerwien.at
office#remes-gmbh.at
wien13#lugar.at
office#rossbacher-at.com
office#weiner-gmbh.at
office#wojtek-installateur.at
office#b-gas.at
office#blasl-gmbh.at
gsht#aon.at
office#ertl-installationen.at
office#jakubek.co.at
office#peham-installateur.at
office#installateur-weber.co.at
office#gebrueder-lamberger.at
office#ar-allround-installationen.at
Also, you can try the urllib3 to set up your streaming pool.
Related
Can anybody help me with this small dilemma? I want to stop the python programm if the IP address 10.10.10.2 is reachable WITHIN 10 SECONDS. if it is not reachable in 10 SECONDS it should handle the exception and continue with the programm. if 10.10.10.2 is reachable then it should print "This IP address is reachable you are using the wrong device please disconnect" i thought about putting a ´´´sys.exit(1)´´´ after the except but im constantly getting errors. I am very new to python or any programming language for that matter so any example snippet codes and help are much appreciated
import pandas as pd
from xml.dom import minidom
import urllib.request
import time
from urllib.error import HTTPError
print(100*"#")
try:
preflash = urllib.request.urlopen("http://10.10.10.2", timeout=10).getcode()
print("Web page status code:", preflash)
print("IP address: 10.10.10.2 is reachable")
except urllib.error.URLError:
correct = urllib.request.urlopen("http://192.168.100.5", timeout=10).getcode()
print("Web page status code:", correct)
print("IP address: 192.168.100.5 is reachable")
print(100*"#")
# Declare url String
url_str = 'http://192.168.100.2/globals.xml'
# open webpage and read values
xml_str = urllib.request.urlopen(url_str).read()
# Parses XML doc to String for Terminal output
xmldoc = minidom.parseString(xml_str)
# Finding the neccassary Set points/ Sollwerte from the xmldoc
time.sleep(0.5)
# prints the order_number from the xmldoc
order_number = xmldoc.getElementsByTagName('order_number')
print("The Order number of the current device is:", order_number[0].firstChild.nodeValue)
print(100*"-")
The output of the python programm looks like this:
Web page status code: 200
IP address: 10.10.10.2 is reachable the programm will shut down in 5 seconds
####################################################################################################
The Order number of the current device is: 58184
----------------------------------------------------------------------------------------------------
The programm needs to shut down if 10.10.10.2 is reachable
Quite Stupid of me, all i needed to do was to add sys.exit(1) function before the except.
try:
preflash = urllib.request.urlopen("http://10.10.10.2", timeout=10).getcode()
print("Web page status code:", preflash)
print("IP address: 10.10.10.2 is reachable")
sys.exit(1)
except urllib.error.URLError:
correct = urllib.request.urlopen("http://192.168.100.5", timeout=10).getcode()
print("Web page status code:", correct)
print("IP address: 192.168.100.5 is reachable")
I want to download a photo from the Iranian website and put the code in the culab and get timeout error and URLerror.
from bs4 import BeautifulSoup
import urllib.request
def make_soup(url):
thepage = urllib.request.urlopen(url)
#req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
#thepage = urlopen(req).read()
soupdata = BeautifulSoup(thepage, "html.parser")
return soupdata
i=1
soup = make_soup("https://www.banikhodro.com/car/pride/")
for img in soup.find_all('img'):
temp = img.get('src')
#print(temp)
if temp[0]=="/":
image = "https://www.banikhodro.com/car/pride/"+temp
else:
image = temp
#print(image)
nametemp = img.get('alt')
nametemp = str(nametemp)
if len(nametemp)== 0:
i=i+1
else:
filename=nametemp
imagefile = open(filename+ ".jpeg", 'wb')
imagefile.write(urllib.request.urlopen(image).read())
imagefile.close()
TimeoutError Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/urllib3/connection.py in _new_conn(self)
158 conn = connection.create_connection(
--> 159 (self._dns_host, self.port), self.timeout, **extra_kw)
160
15 frames
TimeoutError: [Errno 110] Connection timed out
During handling of the above exception, another exception occurred:
NewConnectionError Traceback (most recent call last)
NewConnectionError: <urllib3.connection.VerifiedHTTPSConnection object at 0x7f079e4cdcf8>: Failed to establish a new connection: [Errno 110] Connection timed out
During handling of the above exception, another exception occurred:
MaxRetryError Traceback (most recent call last)
MaxRetryError: HTTPSConnectionPool(host='www.banikhodro.com', port=443): Max retries exceeded with url: /car/pride/ (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f079e4cdcf8>: Failed to establish a new connection: [Errno 110] Connection timed out',))
During handling of the above exception, another exception occurred:
ConnectionError Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/requests/adapters.py in send(self, request, stream, timeout, verify, cert, proxies)
514 raise SSLError(e, request=request)
515
--> 516 raise ConnectionError(e, request=request)
517
518 except ClosedPoolError as e:
ConnectionError: HTTPSConnectionPool(host='www.banikhodro.com', port=443): Max retries exceeded with url: /car/pride/ (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f079e4cdcf8>: Failed to establish a new connection: [Errno 110] Connection timed out',))
add timeout error and connection error.These errors are given to me in GoogelColab when use Iranian Websait for downloded images in colab
Thanks in advance to those who answer my questions
One way of doing this would be:
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.banikhodro.com/car/pride/").content
soup = BeautifulSoup(page, "html5lib").find_all("span", {"class": "photo"})
images = [
f"https://www.banikhodro.com{img.find('img')['src']}" for img in soup
if "Adv" in img.find("img")["src"]
]
for image in images:
print(f"Fetching {image}")
with open(image.rsplit("/")[-1], "wb") as img:
img.write(requests.get(image).content)
This fetches all non-generic images of car offers to your local folder.
183093_1-m.jpg
183098_1-m.jpg
183194_1-m.jpg
183208_1-m.jpg
183209_1-m.jpg
183272_1-m.jpg
183279_1-m.jpg
183286_1-m.jpg
183384_1-m.jpg
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.banikhodro.com/car/pride/").content
soup = BeautifulSoup(page, "html5lib")
images = [
f"https://www.banikhodro.com{img['src']}" for img in soup.find_all('img')
# sort it accordingly based on class or id inside find_all method
]
for image in images:
print(f"Fetching {image}")
with open(image.split("/")[-1], "wb") as img:
img.write(requests.get(image).content)
pip install requests # to install the most preferred requests module
This code will give all kinds of images including footer etc.
You can sort those image data in find_all method which has an parameter called attrs
for more info refer : click here
I attempted to create a sub domain brute forcer in python, but my code doesn't work, there's probably a better way to do it, I just need to be guided in the right direction on how to go about doing this.
import sys
import socket
import requests
host = "paypal.com"
sublist = ["cpanel.", "admin.", "manager.", "secure."]
try:
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
status = s.connect_ex((host, 80))
if status == 0:
print (host + " is up!")
else:
print (host + " is down!")
s.close()
except socket.error:
print (host + " is not reachable")
def checklist():
try:
for lines in sublist:
check = requests.get("http://" + lines + host).status_code
if check == 200:
print "Found: " + lines + host
except Exception:
print "Error"
checklist()
it just prints out "Error" in the terminal, I don't know if its checking the sub domains with the host.
How can I loop through the list and check every subdomain with the site to see if its available and then display it on the terminal?
The error without the except code:
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='cpanel.paypal.com', port=443): Max retries exceeded with url: / (Caused by NewConnectionError(': Failed to establish a new connection: [Errno -2] Name or service not known',))
the problem is that your sending requests too fast,also you can try using headers to cover your python request
import time
try:
time.sleep(1.5)
check = requests.get("http://" + lines + host).status_code
except requests.exceptions.ConnectionError:
r.status_code = "Connection Refused by Host"
I'd like to use the Requests package to connect to the streaming API of a web service. Suppose I use the following code to send a request, receive the response and iterate through the lines of response as they arrive:
import requests
r = requests.get('http://httpbin.org/stream/20', stream=True)
for line in r.iter_lines():
if line:
print line
While waiting to receive new data, we are basically waiting for r.iter_lines() to generate a new piece of data. But what if I lose internet connection while waiting? How can I find out so I can attempt to reconnect?
You can disconnect from your network to have a try. Requests raise such error:
requests.exceptions.ConnectionError: HTTPConnectionPool(host='httpbin.org', port=80): Max retries exceeded with url: /stream/20 (Caused by : [Errno -3] Temporary failure in name resolution)
The error message shows Requests already retries for network error. You can refer to this answer for setting the max_retries. If you wants more customization (e.g. waits between retries), do it in a loop:
import socket
import requests
import time
MAX_RETRIES = 2
WAIT_SECONDS = 5
for i in range(MAX_RETRIES):
try:
r = requests.get('http://releases.ubuntu.com/14.04.1/ubuntu-14.04.1-desktop-amd64.iso',
stream=True, timeout=10)
idx = 1
for chunk in r.iter_content(chunk_size=1024):
if chunk:
print 'Chunk %d received' % idx
idx += 1
break
except requests.exceptions.ConnectionError:
print 'build http connection failed'
except socket.timeout:
print 'download failed'
time.sleep(WAIT_SECONDS)
else:
print 'all tries failed'
EDIT: I tested with a large file. I used iter_content instead, because it's a binary file. iter_lines is based on iter_content (source codes), so I believe the behaviour is same. Procedure: run the codes with network connected. After receiving some chunks, disconnect. Wait 2-3 seconds, reconnect, the downloading continued. So requests package DOES retry for connection lost in the iteration.
Note: If no network when build the connection (requests.get()), ConnectionError is raised; if network lost in the iter_lines / iter_content, socket.timeout is raised.
I am trying to connect to URL https://www.ssehl.co.uk/HALO/publicLogon.do in Python.
The simple solution using requests fails:
import requests
r = requests.get('https://www.ssehl.co.uk/HALO/publicLogon.do')
print r.text
with error
File "c:\Python27\lib\site-packages\requests\adapters.py", line 327, in send
raise ConnectionError(e)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='www.ssehl.co.uk', port=443): Max retries exceeded with url: /HALO/publicLogon.do (Caused by <class 'httplib.BadStatusLine'>: '')
so I tried to get the raw response from the server using library socket:
import socket #for sockets
import sys #for exit
#create an INET, STREAMing socket
try:
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
except socket.error:
print 'Failed to create socket'
sys.exit()
print 'Socket Created'
host = 'www.ssehl.co.uk';
port = 443;
try:
remote_ip = socket.gethostbyname(host)
except socket.gaierror:
#could not resolve
print 'Hostname could not be resolved. Exiting'
sys.exit()
#Connect to remote server
s.connect((remote_ip , port))
print 'Socket Connected to ' + host + ' on ip ' + remote_ip
#Send some data to remote server
message = "GET /HALO/publicLogon.do HTTP/1.1\r\n\r\n"
try :
#Set the whole string
s.sendall(message)
except socket.error:
#Send failed
print 'Send failed'
sys.exit()
print 'Message send successfully'
#Now receive data
reply = s.recv(4096)
print reply
will output:
Socket Created
Socket Connected to www.ssehl.co.uk on ip 161.12.7.194
Message send successfully
Reply:
after reply there is some garbage which I can't paste, however this is a sublime console screenshot:
Screenshot
Is there any way to get a 200 response from the server, just like a browser?
For some reason when you use either Python's built in stuff (urllib2, requests, httplib) or even command line stuff (curl, wget) over https the server spazzes out and gives an erroneous response.
However when you request the page over regular http, it works fine, for example:
import urllib2
print urllib2.urlopen('http://www.ssehl.co.uk/HALO/publicLogon.do').getcode()
prints out
>> 200
My guess is that their servers are configured wrong and your browser somehow deals with it silently.
It worked for me when I used port 80. Sooo:
port = 80;
There must be some error when using HTTPS servers thought Python...
Also, you are sending wrong request. You are not sending the hostname. Fixed request:
message = "GET /HALO/publicLogon.do HTTP/1.1\r\nHostname: %s\r\n\r\n"%host
So here is working code.
I think the problem exists, because port 443 is encrypted. And Python doesn't support encryption (probably).
You should use ssl.wrap_socket if you want support https.
See http://docs.python.org/2/library/ssl.html for details.