python - faster downloading of ~500 webpages (loop)

python - faster downloading of ~500 webpages (loop) - python

For starters I'm new to python so my code below may not be the cleanest. For a program I need to download about 500 webpages. The url's are stored in an array which is populated by a previous function. The downloading part goes something like this:
def downloadpages(num):
import urllib
for i in range(0,numPlanets):
urllib.urlretrieve(downloadlist[i], 'webpages/'+names[i]'.htm')
each file is only around 20KB but it takes at least 10 mins to download all of them. Downloading a single file of the total combined size should only take a minute or two. Is there a way I can speed this up? Thanks
Edit: To anyone who is interested, following the example at http://code.google.com/p/workerpool/wiki/MassDownloader and using 50 threads, the download time has been reduced to about 20 seconds from the original 10 minutes plus. The download speed continues to decrease as the threads are increased up until around 60 threads, after which the download time begins to rise again.

But you're not downloading a single file, here. You're downloading 500 separate pages, each connection involves overhead (for the initial connection), plus whatever else the server is doing (is it serving other people?).
Either way, downloading 500 x 20kb is not the same as downloading a single file of that size.

You can speed up execution significantly by using threads (be careful though, to not overload the server).
Intro material/Code samples:
http://docs.python.org/library/threading.html
Python Package For Multi-Threaded Spider w/ Proxy Support?
http://code.google.com/p/workerpool/wiki/MassDownloader

You can use greenlet to do so.
E.G with the eventlet lib:
urls = [url1, url2, ...]
import eventlet
from eventlet.green import urllib2
def fetch(url):
return urllib2.urlopen(url).read()
pool = eventlet.GreenPool()
for body in pool.imap(fetch, urls):
print "got body", len(body)
All calls in the pools will be pseudo simulatneous.
Of course you must install eventlet with pip or easy_install before.
You have several implementations of greenlets in Python. You could do the same with gevent or another one.

In addition to using concurrency of some sort, make sure whatever method you're using to make the requests uses HTTP 1.1 connection persistence. That will allow each thread to open only a single connection and request all the pages over that, instead of having a TCP/IP setup/teardown for each request. Not sure if urllib2 does that by default; you might have to roll your own.

Related

Drive python api - how to get multiple files at once

i used to use pygdrive3 to connect to google drive. Is there any wat either in this package or google-api-python-client with i could get more files with one request? The files are relative small, but i' d like to fetch 100 pieces at once.
Is there any method for this?
I could do of course to use .files().get_media(fileId=...).execute() 100 times but it' s a quite slow execution.

What I have done in one of my projects is to setup a thread pool and let each of the threads start a request. To do so try following snippets (which you need to adapt to your use case):
from pathos.threading import ThreadPool as Pool
N = 10 # number of threads
my_pool = = Pool(N)
my_pool.amap(<function>, <args>)

Issue with sending POST requests using the library requests

import requests
while True:
try:
posting = requests.post(url,json = data,headers,timeout = 3.05)
except requests.exceptions.ConnectionError as e:
continue
# If a read_timeout error occurs, start from the beginning of the loop
except requests.exceptions.ReadTimeout as e:
continue
a link to more code : Multiple accidental POST requests in Python
This code is using requests library to perform POST requests indefinitely. I noticed that when try fails multiple of times and the while loop starts all over multiple of times, that when I can finally send the post request, I find out multiple of entries from the server side at the same second. I was writing to a txt file at the same time and it showed one entry only. Each entry is 5 readings. Is this an issue with the library itself? Is there a way to fix this?! No matter what kind of conditions that I put it still doesn't work :/ !
You can notice the reading at 12:11:13 has 6 parameters per second while at 12:14:30 (after the delay, it should be every 10 seconds) it is a few entries at the same second!!! 3 entries that make up 18 readings in one second, instead of 6 only!

It looks like the server receives your requests and acts upon them but fails to respond in time (3s is a pretty low timeout, a load spike/paging operation can easily make the server miss it unless it employs special measures). I'd suggest to
process requests asynchronously (e.g. spawn threads; Asynchronous Requests with Python requests discusses ways to do this with requests) and do not use timeouts (TCP has its own timeouts, let it fail instead).
reuse the connection(s) (TCP has quite a bit of overhead for connection establishing/breaking) or use UDP instead.
include some "hints" (IDs, timestamps etc.) to prevent the server from adding duplicate records. (I'd call this one a workaround as the real problem is you're not making sure if your request was processed.)
From the server side, you may want to:
Respond ASAP and act upon the info later. Do not let pending action prevent answering further requests.

Concurrently hit URLS and process results

I have a requirement where I need to hit up to 2000 URLs per minute and save the response to a database. The URLS need to be hit within 5 seconds of the start of every minute (but the response can wait). Then, at the next minute, the same will happen and so on. So, it's time critical.
I've tried using Python multiprocessing and threading to solve the problem. However, some URLs may take up to 30 minutes to respond, which blocks all other URLs from being processed.
I'm also open to using something lower level such as C, but don't know where to start.
Any guidance in the right direction will help, thanks.

You need something lighter than a thread, since if each URL can block for a long time then you'll need to send them all simultaneously instead of via a thread pool.
gevent is a Python wrapper around the eventlib loop that's good at this sort of thing. From their docs:
>>> import gevent
>>> from gevent import socket
>>> urls = ['www.google.com', 'www.example.com', 'www.python.org']
>>> jobs = [gevent.spawn(socket.gethostbyname, url) for url in urls]
>>> gevent.joinall(jobs, timeout=2)
>>> [job.value for job in jobs]
['74.125.79.106', '208.77.188.166', '82.94.164.162']

I am not sure if I have understood the problem correctly, but if you are using 'n' processes and if all 'n' of them get stuck on a response, then changing the language will not solve your issue. Since the bottleneck is the server which you are requesting, and not your local driver code. You can eliminate this dependency by switching to an asynchronous mechanism. Do not wait for the response! Let a callback handle it for you!
EDIT: You might want to have a look at https://github.com/kennethreitz/grequests

Python watch-dog script : load url asynchronously

I have simple Python script which do check few urls :
f = urllib2.urlopen(urllib2.Request(url))
as i have socket timeout setted on 5 seconds sometimes is bothering to wait 5sec * number of urls on results.
Is there any easy standartized way how to run those url checks asynchronously without big overhead. Script must use standart python components on vanilla ubuntu distribution (no additional installations).
Any ideas ?

I wrote something called multibench a long time ago. I used it for almost the same thing you want to do here, which was to call multiple concurrent instances of wget and see how long it takes to complete. It is a crude load testing and performance monitoring tool. You will need to adapt this somewhat, because this runs the same command n times.

Install additional software. It's a waste of time you re-invent something just because of some packaging decisions made by someone else.

set timeout to http response read method in python

I'm building a download manager in python for fun, and sometimes the connection to the server is still on but the server doesn't send me data, so read method (of HTTPResponse) block me forever. This happens, for example, when I download from a server, which located outside of my country, that limit the bandwidth to other countries.
How can I set a timeout for the read method (2 minutes for example)?
Thanks, Nir.

If you're stuck on some Python version < 2.6, one (imperfect but usable) approach is to do
import socket
socket.setdefaulttimeout(10.0) # or whatever
before you start using httplib. The docs are here, and clearly state that setdefaulttimeout is available since Python 2.3 -- every socket made from the time you do this call, to the time you call the same function again, will use that timeout of 10 seconds. You can use getdefaulttimeout before setting a new timeout, if you want to save the previous timeout (including none) so that you can restore it later (with another setdefaulttimeout).
These functions and idioms are quite useful whenever you need to use some older higher-level library which uses Python sockets but doesn't give you a good way to set timeouts (of course it's better to use updated higher-level libraries, e.g. the httplib version that comes with 2.6 or the third-party httplib2 in this case, but that's not always feasible, and playing with the default timeout setting can be a good workaround).

You have to set it during HTTPConnection initialization.
Note: in case you are using an older version of Python, then you can install httplib2; by many, it is considered a superior alternative to httplib, and it does supports timeout.
I've never used it, though, and I'm just reporting what documentation and blogs are saying.

Setting the default timeout might abort a download early if it's large, as opposed to only aborting if it stops receiving data for the timeout value. HTTPlib2 is probably the way to go.

5 years later but hopefully this will help someone else...
I was wrecking my brain trying to figure this out. My problem was a server returning corrupt content and thus giving back less data than it thought it had.
I came up with a nasty solution that seems to be working properly. Here it goes:
# NOTE I directly disabling blocking is not necessary but it represents
# an important piece to the problem so I am leaving it here.
# http_response.fp._sock.socket.setblocking(0)
http_response.fp._sock.settimeout(read_timeout)
http_response.read(chunk_size)
NOTE This solution also works for the python requests ANY library that implements the normal python sockets (which should be all of them?). You just have to go a few levels deeper:
resp.raw._fp.fp._sock.socket.setblocking()
resp.raw._fp.fp._sock.settimeout(read_timeout)
resp.raw.read(chunk_size)
As of this writing, I have not tried the following but in theory it should work:
resp = requests.get(some_url, stream=True)
resp.raw._fp.fp._sock.socket.setblocking()
resp.raw._fp.fp._sock.settimeout(read_timeout)
for chunk in resp.iter_content(chunk_size):
# do stuff
Explanation
I stumbled upon this approach when reading this SO question for setting a timeout on socket.recv
At the end of the day, any http request has a socket. For the httplib that socket is located at resp.raw._fp.fp._sock.socket. The resp.raw._fp.fp._sock is a socket._fileobj (which I honestly didn't look far into) and I imagine it's settimeout method internally sets it on the socket attribute.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

python - faster downloading of ~500 webpages (loop) - python

You can speed up execution significantly by using threads (be careful though, to not overload the server). Intro material/Code samples: http://docs.python.org/library/threading.html Python Package For Multi-Threaded Spider w/ Proxy Support? http://code.google.com/p/workerpool/wiki/MassDownloader

Related

Drive python api - how to get multiple files at once

Issue with sending POST requests using the library requests

Concurrently hit URLS and process results

Python watch-dog script : load url asynchronously

set timeout to http response read method in python

Categories

Resources