Python Requests package: lost connection while streaming - python

I'd like to use the Requests package to connect to the streaming API of a web service. Suppose I use the following code to send a request, receive the response and iterate through the lines of response as they arrive:
import requests
r = requests.get('http://httpbin.org/stream/20', stream=True)
for line in r.iter_lines():
if line:
print line
While waiting to receive new data, we are basically waiting for r.iter_lines() to generate a new piece of data. But what if I lose internet connection while waiting? How can I find out so I can attempt to reconnect?

You can disconnect from your network to have a try. Requests raise such error:
requests.exceptions.ConnectionError: HTTPConnectionPool(host='httpbin.org', port=80): Max retries exceeded with url: /stream/20 (Caused by : [Errno -3] Temporary failure in name resolution)
The error message shows Requests already retries for network error. You can refer to this answer for setting the max_retries. If you wants more customization (e.g. waits between retries), do it in a loop:
import socket
import requests
import time
MAX_RETRIES = 2
WAIT_SECONDS = 5
for i in range(MAX_RETRIES):
try:
r = requests.get('http://releases.ubuntu.com/14.04.1/ubuntu-14.04.1-desktop-amd64.iso',
stream=True, timeout=10)
idx = 1
for chunk in r.iter_content(chunk_size=1024):
if chunk:
print 'Chunk %d received' % idx
idx += 1
break
except requests.exceptions.ConnectionError:
print 'build http connection failed'
except socket.timeout:
print 'download failed'
time.sleep(WAIT_SECONDS)
else:
print 'all tries failed'
EDIT: I tested with a large file. I used iter_content instead, because it's a binary file. iter_lines is based on iter_content (source codes), so I believe the behaviour is same. Procedure: run the codes with network connected. After receiving some chunks, disconnect. Wait 2-3 seconds, reconnect, the downloading continued. So requests package DOES retry for connection lost in the iteration.
Note: If no network when build the connection (requests.get()), ConnectionError is raised; if network lost in the iter_lines / iter_content, socket.timeout is raised.

Related

Connection aborted, RemoteDisconnected('Remote end closed connection without response') after 300k requests

I try to get data through an API.
To achieve that, I created a multiprocessing pool to parallelize my calls and be more efficient. I run the pool with 6 processes (it's my limit to avoid going further than the 5 calls/seconds limit on the API)
The first 300k calls are running fine, but after a dozen hours of running, I always get the same error :
('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
The error is not catched by that piece of code, and therefor, make my run crash :
headers = {"Authorization": "Bearer {bearer_token}".format(bearer_token=bearer_token)}
try:
r = requests.get(url=GET_STUFF, headers=headers)
except (requests.exceptions.ConnectionError, ConnectionResetError):
return {}
I'm not sure to understand why such an error happens here. After sometimes 15 hours of running fine. Is it caused by my internet connection or I am getting banned ?
Is it safe to "just catch" the exception and treat it like an empty response ? (empty response are properly handled in my code so it's not a problem)
PS : here is the code I use to run the multiprocessing pool :
if __name__ == "__main__":
try:
with multiprocessing.Pool(processes=6) as pool:
results = pool.starmap(get_job_offers, parameters)
except Exception as e:
telegram_bot_sendtext('Error Raised')
logging.error(e)

Python Max retries exceeded with url

The scanner works until it finds an external address that is no longer available and then crashes .
I just want to scan only herold.at and extract the email addresses.
I want him to stop scanning outside addresses. I tried with
r = requests.get ('http://github.com', allow_redirects = False) but does not work.
import csv
import requests
import re
import time
from bs4 import BeautifulSoup
# Number of pages plus one
allLinks = [];mails=[];
url = 'https://www.herold.at/gelbe-seiten/wien/was_installateur/?page='
for page in range(3):
time.sleep(5)
print('---', page, '---')
response = requests.get(url + str(page), timeout=1.001)
soup=BeautifulSoup(response.text,'html.parser')
links = [a.attrs.get('href') for a in soup.select('a[href]') ]
for i in links:
#time.sleep(15)
if(("Kontakt" in i or "Porträt")):
allLinks.append(i)
allLinks=set(allLinks)
def findMails(soup):
#time.sleep(15)
for name in soup.find_all("a", "ellipsis"):
if(name is not None):
emailText=name.text
match=bool(re.match('[a-zA-Z0-9-_.]+#[a-zA-Z0-9-_.]+',emailText))
if('#' in emailText and match==True):
emailText=emailText.replace(" ",'').replace('\r','')
emailText=emailText.replace('\n','').replace('\t','')
if(len(mails)==0)or(emailText not in mails):
print(emailText)
mails.append(emailText)
for link in allLinks:
if(link.startswith("http") or link.startswith("www")):
r=requests.get(link)
data=r.text
soup=BeautifulSoup(data,'html.parser')
findMails(soup)
else:
newurl=url+link
r=requests.get(newurl)
data=r.text
soup=BeautifulSoup(data,'html.parser')
findMails(soup)
mails=set(mails)
if(len(mails)==0):
print("NO MAILS FOUND")
Error:
requests.exceptions.ConnectionError: HTTPConnectionPool(host='www.gebrueder-lamberger.at', port=80): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x0000021A24AA7308>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond'))
The error is in this line if(link.startswith("http") or link.startswith("www")): change the http into https and it should work. I tried it and it fetched all emails.
--- 0 ---
--- 1 ---
--- 2 ---
office#smutny-installationen.at
office#offnerwien.at
office#remes-gmbh.at
wien13#lugar.at
office#rossbacher-at.com
office#weiner-gmbh.at
office#wojtek-installateur.at
office#b-gas.at
office#blasl-gmbh.at
gsht#aon.at
office#ertl-installationen.at
office#jakubek.co.at
office#peham-installateur.at
office#installateur-weber.co.at
office#gebrueder-lamberger.at
office#ar-allround-installationen.at
Also, you can try the urllib3 to set up your streaming pool.

Sending data multiple times over a single socket in python

I have a python program where I use a server socket to send data. There is a class which has some Threading methods. Each method checks a queue and if the queue is not empty, it sends the data over the sever socket. Queues are being filled with what clients send to server(server is listening for input requests). Sending is accomplished with a method call:
def send(self, data):
self.sqn += 1
try:
self.clisock.send(data)
except Exception, e:
print 'Send packet failed with error: ' + e.message
When the program starts, sending rate is around 500, but after a while it decreases instantly to 30 with this exception:
Send packet failed with error: <class 'socket.error'>>>[Errno 32] Broken pipe
I don't know what causes the rate to increase! Any idea?
That error is from your send function trying to write to a socket closed on the other side. If that is intended then catch the exception using
import errno, socket
try:
self.clisock.send(data)
except socket.error, err:
if err[0] == errno.EPIPE:
# do something
else:
pass # do something else
If this isn't intended behavior on the part of the client then you'll have to update your post with the corresponding client code.

Limiting number of processes in multiprocessing python

My requirement is to generate hundreds of HTTP POST requests per second. I am doing it using urllib2.
def send():
req = urllib2.Request(url)
req.add_data(data)
response = urllib2.urlopen(req)
while datetime.datetime.now() <= ftime:
p=Process(target=send, args=[])
p.start()
time.sleep(0.001)
The problem is this code sometimes for some iterations throws either of following exceptions:
HTTP 503 Service Unavailable.
URLError: <urlopen error [Errno -2] Name or service not known>
I have tried using requests(HTTP for humans) as well but I am having some proxy issues with that module. Seems like requests is sending http packets to proxy server even when target machine is within same LAN. I don't want packets to go to proxy server.
The simplest way to limit number of concurrent connections is to use a thread pool:
#!/usr/bin/env python
from itertools import izip, repeat
from multiprocessing.dummy import Pool # use threads for I/O bound tasks
from urllib2 import urlopen
def fetch(url_data):
try:
return url_data[0], urlopen(*url_data).read(), None
except EnvironmentError as e:
return url_data[0], None, str(e)
if __name__=="__main__":
pool = Pool(20) # use 20 concurrent connections
params = izip(urls, repeat(data)) # use the same data for all urls
for url, content, error in pool.imap_unorderred(fetch, params):
if error is None:
print("done: %s: %d" % (url, len(content)))
else:
print("error: %s: %s" % (url, error))
503 Service Unavailable is a server error. It might fail to handle the load.
Name or service not known is a dns error. If you need make many requests; install/enable a local caching dns server.

Checking a Python FTP connection

I have a FTP connection from which I am downloading many files and processing them in between. I'd like to be able to check that my FTP connection hasn't timed out in between. So the code looks something like:
conn = FTP(host='blah')
conn.connect()
for item in list_of_items:
myfile = open('filename', 'w')
conn.retrbinary('stuff", myfile)
### do some parsing ###
How can I check my FTP connection in case it timed out during the ### do some parsing ### line?
Send a NOOP command. This does nothing but check that the connection is still going and if you do it periodically it can keep the connection alive.
For example:
conn.voidcmd("NOOP")
If there is a problem with the connection then the FTP object will throw an exception. You can see from the documentation that exceptions are thrown if there is an error:
socket.error and IOError: These are raised by the socket connection and are most likely the ones you are interested in.
exception ftplib.error_reply: Exception raised when an unexpected reply is received from the server.
exception ftplib.error_temp: Exception raised when an error code signifying a temporary error (response codes in the range 400–499) is received.
exception ftplib.error_perm: Exception raised when an error code signifying a permanent error (response codes in the range 500–599) is received.
exception ftplib.error_proto: Exception raised when a reply is received from the server that does not fit the response specifications of the File Transfer Protocol, i.e. begin with a digit in the range 1–5.
Therefore you can use a try-catch block to detect the error and handle it accordingly.
For example this sample of code will catch an IOError, tell you about it and then retry the operation:
retry = True
while (retry):
try:
conn = FTP('blah')
conn.connect()
for item in list_of_items:
myfile = open('filename', 'w')
conn.retrbinary('stuff', myfile)
### do some parsing ###
retry = False
except IOError as e:
print "I/O error({0}): {1}".format(e.errno, e.strerror)
print "Retrying..."
retry = True

Categories