My task is to download 1M+ images from a given list of urls. What is the recommended way to do so?
After having read Greenlet Vs. Threads I looked into gevent, but I fail to get it reliably to run. I played around with a test set of 100 urls and sometimes it finishes in 1.5s but sometimes it takes over 30s which is strange as the timeout* per request is 0.1, so it should never take more than 10s.
*see below in code
I also looked into grequests but they seem to have issues with exception handling.
My 'requirements' are that I can
inspect the errors raised while downloading (timeouts, corrupt images...),
monitor the progress of the number of processed images and
be as fast as possible.
from gevent import monkey; monkey.patch_all()
from time import time
import requests
from PIL import Image
import cStringIO
import gevent.hub
POOL_SIZE = 300
def download_image_wrapper(task):
return download_image(task[0], task[1])
def download_image(image_url, download_path):
raw_binary_request = requests.get(image_url, timeout=0.1).content
image = Image.open(cStringIO.StringIO(raw_binary_request))
image.save(download_path)
def download_images_gevent_spawn(list_of_image_urls, base_folder):
download_paths = ['/'.join([base_folder, url.split('/')[-1]])
for url in list_of_image_urls]
parameters = [[image_url, download_path] for image_url, download_path in
zip(list_of_image_urls, download_paths)]
tasks = [gevent.spawn(download_image_wrapper, parameter_tuple) for parameter_tuple in parameters]
for task in tasks:
try:
task.get()
except Exception:
print 'x',
continue
print '.',
test_urls = # list of 100 urls
t1 = time()
download_images_gevent_spawn(test_urls, 'download_temp')
print time() - t1
I think it will be better to stick with urllib2, by example of https://github.com/gevent/gevent/blob/master/examples/concurrent_download.py#L1
Try this code, I suppose it is what you're asking.
import gevent
from gevent import monkey
# patches stdlib (including socket and ssl modules) to cooperate with other greenlets
monkey.patch_all()
import sys
urls = sorted(chloya_files)
if sys.version_info[0] == 3:
from urllib.request import urlopen
else:
from urllib2 import urlopen
def download_file(url):
data = urlopen(url).read()
img_name = url.split('/')[-1]
with open('c:/temp/img/'+img_name, 'wb') as f:
f.write(data)
return True
from time import time
t1 = time()
tasks = [gevent.spawn(download_file, url) for url in urls]
gevent.joinall(tasks, timeout = 12.0)
print "Sucessful: %s from %s" % (sum(1 if task.value else 0 for task in tasks), len(tasks))
print time() - t1
There's a simple solution using gevent and Requests simple-requests
Use Requests Session for HTTP persistent connection. Since gevent makes Requests asynchronous, I think there's no need for timeout in HTTP requests.
By default, requests.Session caches TCP connections (pool_connections) for 10 hosts and limits 10 concurrent HTTP requests per cached TCP connections (pool_maxsize). The default configuration should be tweaked to suit the need by explicitly creating an http adapter.
session = requests.Session()
http_adapter = requests.adapters.HTTPAdapter(pool_connections=100, pool_maxsize=100)
session.mount('http://', http_adapter)
Break the tasks as producer-consumer. Image downloading is producer task and Image processing is consumer task.
If the image processing library PIL is not asynchronous, it may block producer coroutines. If so, consumer pool can be a gevent.threadpool.ThreadPool. f.e.
from gevent.threadpool import ThreadPool
consumer = ThreadPool(POOL_SIZE)
This is an overview of how it can be done. I didn't test the code.
from gevent import monkey; monkey.patch_all()
from time import time
import requests
from PIL import Image
from io import BytesIO
import os
from urlparse import urlparse
from gevent.pool import Pool
def download(url):
try:
response = session.get(url)
except Exception as e:
print(e)
else:
if response.status_code == requests.codes.ok:
file_name = urlparse(url).path.rsplit('/',1)[-1]
return (response.content,file_name)
response.raise_for_status()
def process(img):
if img is None:
return None
img, name = img
img = Image.open(BytesIO(img))
path = os.path.join(base_folder, name)
try:
img.save(path)
except Exception as e:
print(e)
else:
return True
def run(urls):
consumer.map(process, producer.imap_unordered(download, urls))
if __name__ == '__main__':
POOL_SIZE = 300
producer = Pool(POOL_SIZE)
consumer = Pool(POOL_SIZE)
session = requests.Session()
http_adapter = requests.adapters.HTTPAdapter(pool_connections=100, pool_maxsize=100)
session.mount('http://', http_adapter)
test_urls = # list of 100 urls
base_folder = 'download_temp'
t1 = time()
run(test_urls)
print time() - t1
I will suggest to pay attention to Grablib http://grablib.org/
It is an asynchronic parser based on pycurl and multicurl.
Also it tryes to automatically solve network error (like try again if timeout, etc).
I believe the Grab:Spider module will solve your problems for 99%.
http://docs.grablib.org/en/latest/index.html#spider-toc
Related
I have very powerful cpu, ram and 1Gbit/s of internet. But the code below uses only 1% of my resources.
I am trying the below code but the requests are not asynchronous. At first 1000 requests happen quickly, but after that it starts to slow down.
import requests
from concurrent.futures import ThreadPoolExecutor
import time
list_of_urls = []
with open("urllist.txt","r") as readurl:
for i in readurl.readlines():
list_of_urls.append("http://"+i.replace("\n","").strip())
def get_url(url):
try:
req =requests.get(url,timeout=4)
req.close()
return req
except:
return "TIMEOUT ERR"
start_time = time.time()
with ThreadPoolExecutor(max_workers=int(len(list_of_urls) / 2)) as pool:
list(pool.map(get_url, list_of_urls))
print((start_time - time.time()) * -1)
Is it possible to send async requests to 100,000 websites in 1 second?
I'm using threads to download images from the imagenetdata base.
Here is the link:
http://www.image-net.org/
First, I searched the imagenet database for "puppies" and I was able to get a textfile with 1000+ urls (urls for images)
(If you don't want to go that manner,
I've uploaded the urls onto this pastebin (first 400 or so):
https://pastebin.com/yTcHq0iw )
I then read the first 200 lines of the textfile (thus 200 urls) and used
threads to download those 200 images.
If I download only 100 images (read only the first 100 urls), threads execute perfectly.
However If I try something like 150+ (150, 175,200 etc), the threads will download the first 147, (or 172ish if I'm using 175), and just hang for about 30 seconds or so, before finishing up the last few images.
I'm using Requests to download the images, so not sure if Requests is having trouble making some connections and that is the cause. I'm not too familiar with
Requests lower level API, so not sure how to fix it if it's a Request problem. I found some code on the internet and attempted to override some of the built-in options of Requests, but these tweaks haven't solved the "hanging" problem.
Here is my code:
import os
import requests
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
import time
import queue
from threading import Thread
SAVE_DIR = r'C:\Users\Moondra\Desktop\TEMP\Puppy_threading' #is a constant,
def decorator_function(func):
def wrapper(*args,**kwargs):
session = requests.Session()
retry = Retry(connect=0, backoff_factor=0.2)
adapter = HTTPAdapter(max_retries=retry)
session.mount('http://', adapter)
session.mount('https://', adapter)
return func(*args, session = session, **kwargs)
return wrapper
#Using threading:
image_count = 0
##decorator_function (optional decorator_function)
def download_image(session = None):
global image_count
if not session:
session = requests.Session()
while not q.empty():
try:
r = session.get(q.get(block = False))
except (requests.exceptions.RequestException, UnicodeError) as e:
print(e)
image_count += 1
q.task_done()
continue
image_count += 1
q.task_done()
print('image', image_count)
with open(os.path.join(
SAVE_DIR, 'image_{}.jpg'.format(image_count)),
'wb') as f:
f.write(r.content)
q =queue.Queue()
with open(r'C:\Users\Moondra\Desktop\puppies.txt', 'rt') as f:
for i in range(200):
line = f.readline()
q.put(line.strip())
print(q.qsize())
threads = []
start = time.time()
for i in range(50):
t = Thread(target = download_image)
t.setDaemon(True)
threads.append(t)
t.start()
q.join()
for t in threads:
t.join()
print(t.name, 'has joined')
end = time.time()
print('time taken: {:.4f}'.format(end - start))
I need to scrape roughly 30GB of JSON data from a website API as quickly as possible. I don't need to parse it -- I just need to save everything that shows up on each API URL.
I can request quite a bit of data at a time -- say 1MB or even 50MB 'chunks' (API parameters are encoded in the URL and allow me to select how much data I want per request)
the API places a limit of 1 request per second.
I would like to accomplish this on a laptop and 100MB/sec internet connection
Currently, I am accomplishing this (synchronously & too slowly) by:
-pre-computing all of the (encoded) URL's I want to scrape
-using Python 3's requests library to request each URL and save the resulting JSON one-by-one in separate .txt files.
Basically, my synchronous, too-slow solution looks like this (simplified slightly):
#for each pre-computed encoded URL do:
curr_url_request = requests.get(encoded_URL_i, timeout=timeout_secs)
if curr_url_request.ok:
with open('json_output.txt', 'w') as outfile:
json.dump(curr_url_request.json(), outfile)
What would be a better/faster way to do this? Is there a straight-forward way to accomplish this asynchronously but respecting the 1-request-per-second threshold? I have read about grequests (no longer maintained?), twisted, asyncio, etc but do not have enough experience to know whether/if one of these is the right way to go.
EDIT
Based on Kardaj's reply below, I decided to give async Tornado a try. Here's my current Tornado version (which is heavily based on one of the examples in their docs). It successfully limits concurrency.
The hangup is, how can I do an overall rate-limit of 1 request per second globally across all workers? (Kardaj, the async sleep makes a worker sleep before working, but does not check whether other workers 'wake up' and request at the same time. When I tested it, all workers grab a page and break the rate limit, then go to sleep simultaneously).
from datetime import datetime
from datetime import timedelta
from tornado import httpclient, gen, ioloop, queues
URLS = ["https://baconipsum.com/api/?type=meat",
"https://baconipsum.com/api/?type=filler",
"https://baconipsum.com/api/?type=meat-and-filler",
"https://baconipsum.com/api/?type=all-meat¶s=2&start-with-lorem=1"]
concurrency = 2
def handle_request(response):
if response.code == 200:
with open("FOO"+'.txt', "wb") as thisfile:#fix filenames to avoid overwrite
thisfile.write(response.body)
#gen.coroutine
def request_and_save_url(url):
try:
response = yield httpclient.AsyncHTTPClient().fetch(url, handle_request)
print('fetched {0}'.format(url))
except Exception as e:
print('Exception: {0} {1}'.format(e, url))
raise gen.Return([])
#gen.coroutine
def main():
q = queues.Queue()
tstart = datetime.now()
fetching, fetched = set(), set()
#gen.coroutine
def fetch_url(worker_id):
current_url = yield q.get()
try:
if current_url in fetching:
return
#print('fetching {0}'.format(current_url))
print("Worker {0} starting, elapsed is {1}".format(worker_id, (datetime.now()-tstart).seconds ))
fetching.add(current_url)
yield request_and_save_url(current_url)
fetched.add(current_url)
finally:
q.task_done()
#gen.coroutine
def worker(worker_id):
while True:
yield fetch_url(worker_id)
# Fill a queue of URL's to scrape
list = [q.put(url) for url in URLS] # this does not make a list...it just puts all the URLS into the Queue
# Start workers, then wait for the work Queue to be empty.
for ii in range(concurrency):
worker(ii)
yield q.join(timeout=timedelta(seconds=300))
assert fetching == fetched
print('Done in {0} seconds, fetched {1} URLs.'.format(
datetime.now() - tstart, len(fetched)))
if __name__ == '__main__':
import logging
logging.basicConfig()
io_loop = ioloop.IOLoop.current()
io_loop.run_sync(main)
You are parsing the content and then serializing it again. You can just write the content directly to a file.
curr_url_request = requests.get(encoded_URL_i, timeout=timeout_secs)
if curr_url_request.ok:
with open('json_output.txt', 'w') as outfile:
outfile.write(curr_url_request.content)
That probably removes most of the processing overhead.
tornado has a very powerful asynchronous client. Here's a basic code that may do the trick:
from tornado.httpclient import AsyncHTTPClient
import tornado
URLS = []
http_client = AsyncHTTPClient()
loop = tornado.ioloop.IOLoop.current()
def handle_request(response):
if response.code == 200:
with open('json_output.txt', 'a') as outfile:
outfile.write(response.body)
#tornado.gen.coroutine
def queue_requests():
results = []
for url in URLS:
nxt = tornado.gen.sleep(1) # 1 request per second
res = http_client.fetch(url, handle_request)
results.append(res)
yield nxt
yield results # wait for all requests to finish
loop.add_callback(loop.stop)
loop.add_callback(queue_requests)
loop.start()
This is a straight-forward approach that may lead to too many connections with the remote server. You may have to resolve such problem using a sliding window while queuing the requests.
In case of request timeouts or specific headers required, feel free to read the doc
I write a simple server and it runs well.
So I want to write some codes which will make many post requests to my server simultaneously to simulate a pressure test. I use python.
Suppose the url of my server is http://myserver.com.
file1.jpg and file2.jpg are the files needed to be uploaded to the server.
Here is my testing code. I use threading and urllib2.
async_posts.py
from Queue import Queue
from threading import Thread
from poster.encode import multipart_encode
from poster.streaminghttp import register_openers
import urllib2, sys
num_thread = 4
queue = Queue(2*num_thread)
def make_post(url):
register_openers()
data = {"file1": open("path/to/file1.jpg"), "file2": open("path/to/file2.jpg")}
datagen, headers = multipart_encode(data)
request = urllib2.Request(url, datagen, headers)
start = time.time()
res = urllib2.urlopen(request)
end = time.time()
return res.code, end - start # Return the status code and duration of this request.
def deamon():
while True:
url = queue.get()
status, duration = make_post(url)
print status, duration
queue.task_done()
for _ in range(num_thread):
thd = Thread(target = daemon)
thd.daemon = True
thd.start()
try:
urls = ["http://myserver.com"] * num_thread
for url in urls:
queue.put(url)
queue.join()
except KeyboardInterrupt:
sys.exit(1)
When num_thread is small (ex: 4), my code runs smoothly. But as I switch num_thread to slightly larger number, say 10, all the threading things break down and keep throwing httplib.BadStatusLine error.
I don't know why my code goes wrong or maybe there is better way to do this?
A a reference, my server is written in python using flask and gunicorn.
Thanks in advance.
The code below is an HTTP proxy for content filtering. It uses GET to send the URL of the current site to the server, where it processes it and responds. It runs VERY, VERY, VERY slow. Any ideas on how to make it faster?
Here is the code:
from twisted.internet import reactor
from twisted.web import http
from twisted.web.proxy import Proxy, ProxyRequest
from Tkinter import *
#import win32api
import urllib2
import urllib
import os
import webbrowser
cwd = os.path.abspath(sys.argv[0])[0]
proxies = {}
user = "zachb"
class BlockingProxyRequest(ProxyRequest):
def process(self):
params = {}
params['Location']= self.uri
params['User'] = user
params = urllib.urlencode(params)
req = urllib.urlopen("http://weblock.zbrowntechnology.info/ProgFiles/stats.php?%s" % params, proxies=proxies)
resp = req.read()
req.close()
if resp == "allow":
pass
else:
self.transport.write('''BLOCKED BY ADMIN!''')
self.transport.loseConnection()
ProxyRequest.process(self)
class BlockingProxy(Proxy):
requestFactory = BlockingProxyRequest
factory = http.HTTPFactory()
factory.protocol = BlockingProxy
reactor.listenTCP(8000, factory)
reactor.run()
Anyone have any ideas on how to make this run faster? Or even a better way to write it?
The main cause of slowness in this proxy is probably these three lines:
req = urllib.urlopen("http://weblock.zbrowntechnology.info/ProgFiles/stats.php?%s" % params, proxies=proxies)
resp = req.read()
req.close()
A normal Twisted-based application is single threaded. You have to go out of your way to get threads involved. That means that whenever a request comes in, you are blocking the one and only processing thread on this HTTP request. No further requests are processed until this HTTP request completes.
Try using one of the APIs in twisted.web.client, (eg Agent or getPage). These APIs don't block, so your server will handle concurrent requests concurrently. This should translate into much smaller response times.