Python 3 urllib: 530 too many connections, in loop - python

I am retrieving data files from a FTP server in a loop with the following code:
response = urllib.request.urlopen(url)
data = response.read()
response.close()
compressed_file = io.BytesIO(data)
gin = gzip.GzipFile(fileobj=compressed_file)
Retrieving and processing the first few works fine, but after a few request I am getting the following error:
530 Maximum number of connections exceeded.
I tried closing the connection (see code above) and using a sleep() timer, but this both did not work. What is it I am doing wrong here?

Trying to make urllib do FTP properly makes my brain hurt. By default, it creates a new connection for each file, apparently without really properly ensuring the connections close.
ftplib is more appropriate I think.
Since I happen to be working on the same data you are(were)... Here is a very specific answer decompressing the .gz files and passing them into ish_parser (https://github.com/haydenth/ish_parser).
I think it is also clear enough to serve as a general answer.
import ftplib
import io
import gzip
import ish_parser # from: https://github.com/haydenth/ish_parser
ftp_host = "ftp.ncdc.noaa.gov"
parser = ish_parser.ish_parser()
# identifies what data to get
USAF_ID = '722950'
WBAN_ID = '23174'
YEARS = range(1975, 1980)
with ftplib.FTP(host=ftp_host) as ftpconn:
ftpconn.login()
for year in YEARS:
ftp_file = "pub/data/noaa/{YEAR}/{USAF}-{WBAN}-{YEAR}.gz".format(USAF=USAF_ID, WBAN=WBAN_ID, YEAR=year)
print(ftp_file)
# read the whole file and save it to a BytesIO (stream)
response = io.BytesIO()
try:
ftpconn.retrbinary('RETR '+ftp_file, response.write)
except ftplib.error_perm as err:
if str(err).startswith('550 '):
print('ERROR:', err)
else:
raise
# decompress and parse each line
response.seek(0) # jump back to the beginning of the stream
with gzip.open(response, mode='rb') as gzstream:
for line in gzstream:
parser.loads(line.decode('latin-1'))
This does read the whole file into memory, which could probably be avoided using some clever wrappers and/or yield or something... but works fine for a year's worth of hourly weather observations.

Probably a pretty nasty workaround, but this worked for me. I made a script (here called test.py) which does the request (see code above). The code below is used in the loop I mentioned and calls test.py
from subprocess import call
with open('log.txt', 'a') as f:
call(['python', 'test.py', args[0], args[1]], stdout=f)

Related

Transform a multiple line url request into a function in Python

I try to download a serie of text files from different websites. I am using urllib.request with Python. I want to expend the list of URL without making the code long.
The working sequence is
import urllib.request
url01 = 'https://web.site.com/this.txt'
url02 = 'https://web.site.com/kind.txt'
url03 = 'https://web.site.com/of.txt'
url04 = 'https://web.site.com/link.txt'
[...]
urllib.request.urlretrieve(url01, "Liste n°01.txt")
urllib.request.urlretrieve(url02, "Liste n°02.txt")
urllib.request.urlretrieve(url03, "Liste n°03.txt")
[...]
The number of file to download is increasing and I want to keep the second part of the code short.
I tried
i = 0
while i<51
i = i +1
urllib.request.urlretrieve( i , "Liste n°0+"i"+.txt")
It doesn't work and I am thinking that a while loop can be use for string but not for request.
So I was thinking of making it a function.
def newfunction(i)
return urllib.request.urlretrieve(url"i", "Liste n°0"+1+".txt")
But it seem that I am missing a big chunk of it.
This request is working but it seem I cannot transform it for long list or URL.
As a general suggestion, I'd recommend the requests module for Python, rather than urllib.
Based on that, some naive code for a possible function:
import requests
def get_file(site, filename):
target = site + "/" + filename
try:
r = requests.get(target, allow_redirects=True)
open(filename, 'wb').write(r.content)
return r.status_code
except requests.exceptions.RequestException as e:
print("File not downloaded, error: {}".format(e))
You can then call the function, passing in parameters of site and file name:
get_file('https://web.site.com', 'this.txt')
The function will raise an exception, but not stop execution, if it cannot download a file. You could expand exception handling to deal with files not being writable, but this should be a start.
It seems as if you're not casting the variable i to an integer before your concatenating it to the url string. That may be the reason why you're code isn't working. The while-loop/for-loop approach shouldn't effect whether or not the requests get sent out. I recommend using the requests module for making requests as well. Mike's post covers what the function should kind of look like. I also recommend creating a sessions object if you're going to be making a whole lot of requests in a piece of code. The sessions object will keep the underlying TCP connection open while you make your requests, which should reduce latency, CPU usage, and network congestion (https://en.wikipedia.org/wiki/HTTP_persistent_connection#Advantages). The code would look something like this:
import requests
with requests.Session() as s:
for i in range(10):
s.get(str(i)+'.com') # make request
# write to file here
To cast to a string you would want something like this:
i = 0
while i<51
i = i +1
urllib.request.urlretrieve( i , "Liste n°0" + str(i) + ".txt")

NFQueue/Scapy Man in the Middle

I'm trying to construct a man in the middle attack on a webpage (i.e. HTTP traffic). I'm doing this by using a Linux machine attached to Ethernet and a client attached to the Linux box via its WiFi hotspot.
What I've done so far is use NFQueue from within the IPTables Linux firewall to route all TCP packets on the FORWARD chain to the NFQueue queue, which a Python script is picking up and then processing those rules. I'm able to read the data off of the HTTP response packets, but whenever I try to modify them and pass them back (accept the packets), I'm getting an error regarding the strings:
Exception AttributeError: "'str' object has no attribute 'build_padding'" in 'netfilterqueue.global_callback' ignored
My code is here, which includes things that I've tried that didn't work. Notably, I'm using a third-party extension for scapy called scapy_http that may be interfering with things, and I'm using a webpage that is not being compressed by gzip because that was messing with things as well. The test webpage that I'm using is here.
#scapy
from scapy.all import *
#nfqueue import
from netfilterqueue import NetfilterQueue
#scapy http extension, not really needed
import scapy_http.http
#failed gzip decoding, also tried some other stuff
#import gzip
def print_and_accept(packet):
#convert nfqueue datatype to scapy-compatible
pkt = IP(packet.get_payload())
#is this an HTTP response?
if pkt[TCP].sport == 80:
#legacy trial that doesn't work
#data = packet.get_data()
print('HTTP Packet Found')
#check what's in the payload
stringLoad = str(pkt[TCP].payload)
#deleted because printing stuff out clogs output
#print(stringLoad)
#we only want to modify a specific packet:
if "<title>Acids and Bases: Use of the pKa Table</title>" in stringLoad:
print('Target Found')
#strings kind of don't work, I think this is a me problem
#stringLoad.replace('>Acids and Bases: Use of the pK<sub>a</sub>', 'This page has been modified: a random ')
#pkt[TCP].payload = stringLoad
#https://stackoverflow.com/questions/27293924/change-tcp-payload-with-nfqueue-scapy
payload_before = len(pkt[TCP].payload)
# I suspect this line is a problem: the string assigns,
# but maybe under the hood scapy doesn't like that very much
pkt[TCP].payload = str(pkt[TCP].payload).replace("Discussion", "This page has been modified")
#recalculate length
payload_after = len(pkt[TCP].payload)
payload_dif = payload_after - payload_before
pkt[IP].len = pkt[IP].len + payload_dif
#recalculate checksum
del pkt[TCP].chksum
del pkt[IP].chksum
del pkt.chksum
print('Packet Modified')
#redudant
#print(stringLoad)
#this throws an error (I think)
print(str(pkt[TCP].payload))
#no clue if this works or not yet
#goal here is to reassign modified packet to original parameter
packet.set_payload(str(pkt))
#this was also throwing the error, so tried to move away from it
#print(pkt.show2())
#bunch of legacy code that didn't work
#print(GET_print(pkt))
#print(pkt.show())
#decompressed_data = zlib.decompress(str(pkt[TCP].payload), 16 + zlib.MAX_WBITS)
#print(decompressed_data)
#print(str(gzip.decompress(pkt[TCP].payload)))
# print(pkt.getlayer(Raw).load)
#print('HTTP Contents Shown')
packet.accept()
def GET_print(packet1):
ret = "***************************************GET PACKET****************************************************\n"
ret += "\n".join(packet1.sprintf("{Raw:%Raw.load%}\n").split(r"\r\n"))
ret += "*****************************************************************************************************\n"
return ret
print('Test: Modify a very specific target')
print('Program Starting')
nfqueue = NetfilterQueue()
nfqueue.bind(1, print_and_accept)
try:
print('Packet Interface Starting')
nfqueue.run()
except KeyboardInterrupt:
print('\nProgram Ending')
nfqueue.unbind()
Apologies in advance if this is hard to read or badly formatted code; Python isn't a language that I write in often. Any help is greatly appreciated!

Django - Heroku - Serving dynamic large file timing out

I have one function view that creates a report using xlsxwriter, it is created on the fly using a StringIO as buffer and finally sending through HttpResponse. It works well using Local Server.
The problem is that on Heroku, after some seconds (documentation mention 30 seconds timeout and not modifiable) the server hangs out and reboot the web process, giving error as a response.
What is the best way to...?:
create an xmlx file on the fly (dynamically) in memory
serve the entire file to the client.
prevent server to hang out because of the long process running
This is a piece of the code I am using:
def reporte_usuarios(request):
from xlsxwriter.workbook import Workbook
try:
import cStringIO as StringIO
except ImportError:
import StringIO
# create a workbook in memory
output = StringIO.StringIO()
workbook = Workbook(output)
bold = workbook.add_format({'bold': True})
# get the data
from django.db.models import Count
usuarios = User.objects.filter(....... # all filter stuff
for usr in usuarios:
if usr.activos > 0:
# create a workbook sheet every User registered
ws = workbook.add_worksheet(u'%s' % usr.username)
# some relevant user data
ws.write(1, 1, u'USUARIO: %s' % usr.username)
...
# get rows for user
log = LogActivos.objects.filter(usuario=usr).select_related('activo__unidad__id', 'activo__unidad__nombre', 'activo__nombre')
# write headers
ws.write(3, 0, u'FECHA', bold)
...
sig_fila = 4 #starting row for data (after headers)
for l in log:
# write all data
ws.write(sig_fila, 0, u'%s' % l.fecha)
...
sig_fila += 1
# close the workbook
workbook.close()
# go to the beginning of the buffer
output.seek(0)
# response using the buffer
response = HttpResponse(output.read(), content_type='application/vnd.openxmlformats-officedocument.spreadsheetml.sheet')
response['Content-Disposition'] = 'attachment; filename="ACTIVOS_USUARIOS__%s.xlsx"' % datetime.now().strftime("%Y%m%d_%H%M")
return response
Notes: I am using Gunicorn on Heroku, django 1.9.13 and python 2.7.11
IMHO you should follow a totally different approach in this case.
As you are generating a file rather big in size, it's normal for the system to hang out because of a timeout error.
What you could do instead is to deploy a background task queue, like Celery or DjangoRQ. With that, you will get a background task to create this file using your user's data, and then you can let your user know that it's ready by any mean, like a notification or an email.
If you need more details regarding how you can do something like this, let me know and I can help :)

aiobotocore-aiohttp - Get S3 file content and stream it in the response

I want to get the content of an uploaded file on S3 using botocore and aiohttp service. As the files may have a huge size:
I don't want to store the whole file content in memory,
I want to be able to handle other requests while downloading files from S3 (aiobotocore, aiohttp),
I want to be able to apply modifications on the files I download, so I want to treat it line by line and stream the response to the client
For now, I have the following code in my aiohttp handler:
import asyncio
import aiobotocore
from aiohttp import web
#asyncio.coroutine
def handle_get_file(loop):
session = aiobotocore.get_session(loop=loop)
client = session.create_client(
service_name="s3",
region_name="",
aws_secret_access_key="",
aws_access_key_id="",
endpoint_url="http://s3:5000"
)
response = yield from client.get_object(
Bucket="mybucket",
Key="key",
)
Each time I read one line from the given file, I want to send the response. Actually, get_object() returns a dict with a Body (ClientResponseContentProxy object) inside. Using the method read(), how can I get a chunk of the expected response and stream it to the client ?
When I do :
for content in response['Body'].read(10):
print("----")
print(content)
The code inside the loop is never executed.
But when I do :
result = yield from response['Body'].read(10)
I get the content of the file in result. I am a little bit confused about how to use read() here.
Thanks
it's because the aiobotocore api is different than the one of botocore , here read() returns a FlowControlStreamReader.read generator for which you need to yield from
it looks something like that (taken from https://github.com/aio-libs/aiobotocore/pull/19)
resp = yield from s3.get_object(Bucket='mybucket', Key='k')
stream = resp['Body']
try:
chunk = yield from stream.read(10)
while len(chunk) > 0:
...
chunk = yield from stream.read(10)
finally:
stream.close()
and actually in your case you can even use readline()
https://github.com/KeepSafe/aiohttp/blob/c39355bef6c08ded5c80e4b1887e9b922bdda6ef/aiohttp/streams.py#L587

Requests with multiple connections

I use the Python Requests library to download a big file, e.g.:
r = requests.get("http://bigfile.com/bigfile.bin")
content = r.content
The big file downloads at +- 30 Kb per second, which is a bit slow. Every connection to the bigfile server is throttled, so I would like to make multiple connections.
Is there a way to make multiple connections at the same time to download one file?
You can use HTTP Range header to fetch just part of file (already covered for python here).
Just start several threads and fetch different range with each and you're done ;)
def download(url,start):
req = urllib2.Request('http://www.python.org/')
req.headers['Range'] = 'bytes=%s-%s' % (start, start+chunk_size)
f = urllib2.urlopen(req)
parts[start] = f.read()
threads = []
parts = {}
# Initialize threads
for i in range(0,10):
t = threading.Thread(target=download, i*chunk_size)
t.start()
threads.append(t)
# Join threads back (order doesn't matter, you just want them all)
for i in threads:
i.join()
# Sort parts and you're done
result = ''.join(parts[i] for i in sorted(parts.keys()))
Also note that not every server supports Range header (and especially servers with php scripts responsible for data fetching often don't implement handling of it).
Here's a Python script that saves given url to a file and uses multiple threads to download it:
#!/usr/bin/env python
import sys
from functools import partial
from itertools import count, izip
from multiprocessing.dummy import Pool # use threads
from urllib2 import HTTPError, Request, urlopen
def download_chunk(url, byterange):
req = Request(url, headers=dict(Range='bytes=%d-%d' % byterange))
try:
return urlopen(req).read()
except HTTPError as e:
return b'' if e.code == 416 else None # treat range error as EOF
except EnvironmentError:
return None
def main():
url, filename = sys.argv[1:]
pool = Pool(4) # define number of concurrent connections
chunksize = 1 << 16
ranges = izip(count(0, chunksize), count(chunksize - 1, chunksize))
with open(filename, 'wb') as file:
for s in pool.imap(partial(download_part, url), ranges):
if not s:
break # error or EOF
file.write(s)
if len(s) != chunksize:
break # EOF (servers with no Range support end up here)
if __name__ == "__main__":
main()
The end of file is detected if a server returns empty body, or 416 http code, or if the response size is not chunksize exactly.
It supports servers that doesn't understand Range header (everything is downloaded in a single request in this case; to support large files, change download_chunk() to save to a temporary file and return the filename to be read in the main thread instead of the file content itself).
It allows to change independently number of concurrent connections (pool size) and number of bytes requested in a single http request.
To use multiple processes instead of threads, change the import:
from multiprocessing.pool import Pool # use processes (other code unchanged)
This solution requires the linux utility named "aria2c", but it has the advantage of easily resuming downloads.
It also assumes that all the files you want to download are listed in the http directory list for location MY_HTTP_LOC. I tested this script on an instance of lighttpd/1.4.26 http server. But, you can easily modify this script so that it works for other setups.
#!/usr/bin/python
import os
import urllib
import re
import subprocess
MY_HTTP_LOC = "http://AAA.BBB.CCC.DDD/"
# retrieve webpage source code
f = urllib.urlopen(MY_HTTP_LOC)
page = f.read()
f.close
# extract relevant URL segments from source code
rgxp = '(\<td\ class="n"\>\<a\ href=")([0-9a-zA-Z\(\)\-\_\.]+)(")'
results = re.findall(rgxp,str(page))
files = []
for match in results:
files.append(match[1])
# download (using aria2c) files
for afile in files:
if os.path.exists(afile) and not os.path.exists(afile+'.aria2'):
print 'Skipping already-retrieved file: ' + afile
else:
print 'Downloading file: ' + afile
subprocess.Popen(["aria2c", "-x", "16", "-s", "20", MY_HTTP_LOC+str(afile)]).wait()
you could use a module called pySmartDLfor this it uses multiple threads and can do a lot more also this module gives a download bar by default.
for more info check this answer

Categories