Scheduling periodic uploads, good practice? - python

I am sending a bunch of sensor readings from my raspberrypi to a hosted php+mysql database using 3g network.
To save bandwidth and energy, it was recommended that rather than uploading sensor readings every second, I should only upload periodically i.e. say every 5 minutes. So i decided to collect the readings using JSON format to ease up the uploading POST process:
>>> import json
>>> url = 'https://api.github.com/some/endpoint'
>>> payload = {'some': 'data'}
I also set a timer using the timer2 module:
>>> timer2.apply_interval(msecs, fun, args, kwargs, priority=0)
Now fun is called every 5*3600*1000. In fun, I upload the payload and reset the payload content for next round of data collection :
>>> r = requests.post(url, data=json.dumps(payload))
The question:
Is it good practice to reset the contents of the variable payload from fun while it is collecting data from main thread?
Is there a better way of doing this?

You should organize your program to model the data-flow from the sensor to the database. So, you would have a single timer that calls the sampling method at the polling rate. Pass the result of the sampling method to a buffering method. Fill the buffer until the buffering period has elapsed. Finally, post the buffer, each time it fills up.
import random
import time
def generate_samples():
while True:
input('Press ENTER to simulate next sample...')
yield random.random()
def period_elapsed(period_start, period_duration):
return period_start + period_duration < time.time()
def collect_buffers(samples, buffer_period):
buffer_time = time.time()
buffer_samples = []
for sample in samples:
buffer_samples.append(sample)
if period_elapsed(buffer_time, buffer_period):
yield buffer_samples
buffer_samples = []
buffer_time = time.time()
def post_buffers(buffers, url):
for b in buffers:
requests.post(url, data=json.dumps(buffer))
post_buffers(collect_buffers(generate_samples(), 300), 'http://localhost:3000')

Related

What are the things I should avoid in my python Cloud Function to avoid memory leak?

Generic question
My python Cloud Function raises about 0.05 memory error per second - it is invoked about 150 times per second. I get the feeling my function leaves memory residuals behind, which causes its instances to crash once they have dealt with many requests. What are the things you should do or not do so that your function instance doesn't eat "a bit more of its allocated memory" each time it's called ? I've been pointed to the docs to learn that I should delete all temporary files as this is writing in memory but I don't think I've written any.
More context
The code of my function can be summed up as the following.
Global context: Grab a file on Google Cloud Storage containing a list of known User-Agents of bots. Instantiate an Error Reporting client.
If User-Agent identifies a bot, return a 200 code. Else parse the arguments of the request, rename them, format them, timestamp the reception of the request.
Send the resulting message to Pub/Sub in a JSON string.
Return a 200 code
I believe my instances are gradually consuming all memory available because of this graph I've done in Stackdriver:
It is a heat map of the memory usage across my Cloud function's instances, red and yellow indicating that most of my function instances are consuming this range of memory. Because of the cycle that seems to appear, I interpreted it as a gradual fill-up of my instances' memory, until they crash and new instances are spawned. This cycle remains if I raise the memory allocated to the function, it just raises the upper bound of memory usage the cycle follows.
Edit: Code excerpt and more context
The requests contain parameters that help implement tracking on an ecommerce website. Now that I copy it, there might be an anti-pattern where I modify form['products'] while iterating over it, but I don't think it would have anything to do with memory waste ?
from json import dumps
from datetime import datetime
from pytz import timezone
from google.cloud import storage
from google.cloud import pubsub
from google.cloud import error_reporting
from unidecode import unidecode
# this is done in global context because I only want to load the BOTS_LIST at
# cold start
PROJECT_ID = '...'
TOPIC_NAME = '...'
BUCKET_NAME = '...'
BOTS_PATH = '.../bots.txt'
gcs_client = storage.Client()
cf_bucket = gcs_client.bucket(BUCKET_NAME)
bots_blob = cf_bucket.blob(BOTS_PATH)
BOTS_LIST = bots_blob.download_as_string().decode('utf-8').split('\r\n')
del cf_bucket
del gcs_client
del bots_blob
err_client = error_reporting.Client()
def detect_nb_products(parameters):
'''
Detects number of products in the fields of the request.
'''
# ...
def remove_accents(d):
'''
Takes a dictionary and recursively transforms its strings into ASCII
encodable ones
'''
# ...
def safe_float_int(x):
'''
Custom converter to float / int
'''
# ...
def build_hit_id(d):
'''concatenate specific parameters from a dictionary'''
# ...
def cloud_function(request):
"""Actual Cloud Function"""
try:
time_received = datetime.now().timestamp()
# filtering bots
user_agent = request.headers.get('User-Agent')
if all([bot not in user_agent for bot in BOTS_LIST]):
form = request.form.to_dict()
# setting the products field
nb_prods = detect_nb_products(form.keys())
if nb_prods:
form['products'] = [{'product_name': form['product_name%d' % i],
'product_price': form['product_price%d' % i],
'product_id': form['product_id%d' % i],
'product_quantity': form['product_quantity%d' % i]}
for i in range(1, nb_prods + 1)]
useful_fields = [] # list of keys I'll keep from the form
unwanted = set(form.keys()) - set(useful_fields)
for key in unwanted:
del form[key]
# float conversion
if nb_prods:
for prod in form['products']:
prod['product_price'] = safe_float_int(
prod['product_price'])
# adding timestamp/hour/minute, user agent and date to the hit
form['time'] = int(time_received)
form['user_agent'] = user_agent
dt = datetime.fromtimestamp(time_received)
form['date'] = dt.strftime('%Y-%m-%d')
remove_accents(form)
friendly_names = {} # dict to translate the keys I originally
# receive to human friendly ones
new_form = {}
for key in form.keys():
if key in friendly_names.keys():
new_form[friendly_names[key]] = form[key]
else:
new_form[key] = form[key]
form = new_form
del new_form
# logging
print(form)
# setting up Pub/Sub
publisher = pubsub.PublisherClient()
topic_path = publisher.topic_path(PROJECT_ID, TOPIC_NAME)
# sending
hit_id = build_hit_id(form)
message_future = publisher.publish(topic_path,
dumps(form).encode('utf-8'),
time=str(int(time_received * 1000)),
hit_id=hit_id)
print(message_future.result())
return ('OK',
200,
{'Access-Control-Allow-Origin': '*'})
else:
# do nothing for bots
return ('OK',
200,
{'Access-Control-Allow-Origin': '*'})
except KeyError:
err_client.report_exception()
return ('err',
200,
{'Access-Control-Allow-Origin': '*'})
There are a few things you could try (theoretical answer, I didn't play with CFs yet):
explicitly delete the temporary variables that you allocate on the bots processing path, which may be referencing each-other thus preventing the memory garbage collector from freeing them (see https://stackoverflow.com/a/33091796/4495081): nb_prods, unwanted, form, new_form, friendly_names, for example.
if unwanted is always the same make it a global instead.
delete form before re-assigning it to new_form (the old form object remains); also deleting new_form won't actually save much since the object remains referenced by form. I.e. change:
form = new_form
del new_form
into
del form
form = new_form
explicitly invoke the memory garbage collector after you publish your topic and before returning. I'm unsure if that's applicable to CFs or if the invocation is immediately effective or not (for example in GAE it's not, see When will memory get freed after completing the request on App Engine Backend Instances?). This may also be overkill and potentially hurt you CF's performance, see if/how it works for you.
gc.collect()

How to download multiple files simultaneously and trigger specific actions for each one done?

I need help with a feature I try to implement, unfortunately I'm not very comfortable with multithreading.
My script download 4 different files from internet, and calls a dedicated function for each one, then saving all.
The problem is that I'm doing it step by step, therefore I have to wait for each download to finish in order to proceed to the next one.
I see what I should do to solve this, but I don't succeed to code it.
Actual Behaviour:
url_list = [Url1, Url2, Url3, Url4]
files_list = []
files_list.append(downloadFile(Url1))
handleFile(files_list[-1], type=0)
...
files_list.append(downloadFile(Url4))
handleFile(files_list[-1], type=3)
saveAll(files_list)
Needed Behaviour:
url_list = [Url1, Url2, Url3, Url4]
files_list = []
for url in url_list:
callThread(files_list.append(downloadFile(url)), # function
handleFile(files_list[url.index], type=url.index) # trigger
#use a thread for downloading
#once file is downloaded, it triggers his associated function
#wait for all files to be treated
saveAll(files_list)
Thanks for your help !
Typical approach is to put the IO heavy part like fetching data over the internet and data processing into the same function:
import random
import threading
import time
from concurrent.futures import ThreadPoolExecutor
import requests
def fetch_and_process_file(url):
thread_name = threading.currentThread().name
print(thread_name, "fetch", url)
data = requests.get(url).text
# "process" result
time.sleep(random.random() / 4) # simulate work
print(thread_name, "process data from", url)
result = len(data) ** 2
return result
threads = 2
urls = ["https://google.com", "https://python.org", "https://pypi.org"]
executor = ThreadPoolExecutor(max_workers=threads)
with executor:
results = executor.map(fetch_and_process_file, urls)
print()
print("results:", list(results))
outputs:
ThreadPoolExecutor-0_0 fetch https://google.com
ThreadPoolExecutor-0_1 fetch https://python.org
ThreadPoolExecutor-0_0 process data from https://google.com
ThreadPoolExecutor-0_0 fetch https://pypi.org
ThreadPoolExecutor-0_0 process data from https://pypi.org
ThreadPoolExecutor-0_1 process data from https://python.org

How to speed up web scraping in python

I'm working on a project for school and I am trying to get data about movies. I've managed to write a script to get the data I need from IMDbPY and Open Movie DB API (omdbapi.com). The challenge I'm experiencing is that I'm trying to get data for 22,305 movies and each request takes about 0.7 seconds. Essentially my current script will take about 8 hours to complete. Looking for any way to maybe use multiple requests at the same time or any other suggestions to significantly speed up the process of getting this data.
import urllib2
import json
import pandas as pd
import time
import imdb
start_time = time.time() #record time at beginning of script
#used to make imdb.com think we are getting this data from a browser
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
headers = { 'User-Agent' : user_agent }
#Open Movie Database Query url for IMDb IDs
url = 'http://www.omdbapi.com/?tomatoes=true&i='
#read the ids from the imdb_id csv file
imdb_ids = pd.read_csv('ids.csv')
cols = [u'Plot', u'Rated', u'tomatoImage', u'Title', u'DVD', u'tomatoMeter',
u'Writer', u'tomatoUserRating', u'Production', u'Actors', u'tomatoFresh',
u'Type', u'imdbVotes', u'Website', u'tomatoConsensus', u'Poster', u'tomatoRotten',
u'Director', u'Released', u'tomatoUserReviews', u'Awards', u'Genre', u'tomatoUserMeter',
u'imdbRating', u'Language', u'Country', u'imdbpy_budget', u'BoxOffice', u'Runtime',
u'tomatoReviews', u'imdbID', u'Metascore', u'Response', u'tomatoRating', u'Year',
u'imdbpy_gross']
#create movies dataframe
movies = pd.DataFrame(columns=cols)
i=0
for i in range(len(imdb_ids)-1):
start = time.time()
req = urllib2.Request(url + str(imdb_ids.ix[i,0]), None, headers) #request page
response = urllib2.urlopen(req) #actually call the html request
the_page = response.read() #read the json from the omdbapi query
movie_json = json.loads(the_page) #convert the json to a dict
#get the gross revenue and budget from IMDbPy
data = imdb.IMDb()
movie_id = imdb_ids.ix[i,['imdb_id']]
movie_id = movie_id.to_string()
movie_id = int(movie_id[-7:])
data = data.get_movie_business(movie_id)
data = data['data']
data = data['business']
#get the budget $ amount out of the budget IMDbPy string
try:
budget = data['budget']
budget = budget[0]
budget = budget.replace('$', '')
budget = budget.replace(',', '')
budget = budget.split(' ')
budget = str(budget[0])
except:
None
#get the gross $ amount out of the gross IMDbPy string
try:
budget = data['budget']
budget = budget[0]
budget = budget.replace('$', '')
budget = budget.replace(',', '')
budget = budget.split(' ')
budget = str(budget[0])
#get the gross $ amount out of the gross IMDbPy string
gross = data['gross']
gross = gross[0]
gross = gross.replace('$', '')
gross = gross.replace(',', '')
gross = gross.split(' ')
gross = str(gross[0])
except:
None
#add gross to the movies dict
try:
movie_json[u'imdbpy_gross'] = gross
except:
movie_json[u'imdbpy_gross'] = 0
#add gross to the movies dict
try:
movie_json[u'imdbpy_budget'] = budget
except:
movie_json[u'imdbpy_budget'] = 0
#create new dataframe that can be merged to movies DF
tempDF = pd.DataFrame.from_dict(movie_json, orient='index')
tempDF = tempDF.T
#add the new movie to the movies dataframe
movies = movies.append(tempDF, ignore_index=True)
end = time.time()
time_took = round(end-start, 2)
percentage = round(((i+1) / float(len(imdb_ids))) * 100,1)
print i+1,"of",len(imdb_ids),"(" + str(percentage)+'%)','completed',time_took,'sec'
#increment counter
i+=1
#save the dataframe to a csv file
movies.to_csv('movie_data.csv', index=False)
end_time = time.time()
print round((end_time-start_time)/60,1), "min"
Use Eventlet library to fetch concurently
As advised in comments, you shall fetch your feeds concurrently. This can be done by using treading, multiprocessing, or using eventlet.
Install eventlet
$ pip install eventlet
Try web crawler sample from eventlet
See: http://eventlet.net/doc/examples.html#web-crawler
Understanding concurrency with eventlet
With threading system takes care of switching between your threads. This brings big problem in case you have to access some common data structures, as you never know, which other thread is currently accessing your data. You then start playing with synchronized blocks, locks, semaphores - just to synchronize access to your shared data structures.
With eventlet it goes much simpler - you always run only one thread and jump between them only at I/O instructions or at other eventlet calls. The rest of your code runs uninterrupted and without a risk, another thread would mess up with our data.
You only have to take care of following:
all I/O operations must be non-blocking (this is mostly easy, eventlet provides non-blocking versions for most of the I/O you need).
your remaining code must not be CPU expensive as it would block switching between "green" threads for longer time and the power of "green" multithreading would be gone.
Great advantage with eventlet is, that it allows to write code in straightforward way without spoiling it (too) much with Locks, Semaphores etc.
Apply eventlet to your code
If I understand it correctly, list of urls to fetch is known in advance and order of their processing in your analysis is not important. This shall allow almost direct copy of example from eventlet. I see, that an index i has some significance, so you might consider mixing url and the index as a tuple and processing them as independent jobs.
There are definitely other methods, but personally I have found eventlet really easy to use comparing it to other techniques while getting really good results (especially with fetching feeds). You just have to grasp main concepts and be a bit careful to follow eventlet requirements (keep being non-blocking).
Fetching urls using requests and eventlet - erequests
There are various packages for asynchronous processing with requests, one of them using eventlet and being namederequests see https://github.com/saghul/erequests
Simple sample fetching set of urls
import erequests
# have list of urls to fetch
urls = [
'http://www.heroku.com',
'http://python-tablib.org',
'http://httpbin.org',
'http://python-requests.org',
'http://kennethreitz.com'
]
# erequests.async.get(url) creates asynchronous request
async_reqs = [erequests.async.get(url) for url in urls]
# each async request is ready to go, but not yet performed
# erequests.map will call each async request to the action
# what returns processed request `req`
for req in erequests.map(async_reqs):
if req.ok:
content = req.content
# process it here
print "processing data from:", req.url
Problems for processing this specific question
We are able to fetch and somehow process all urls we need. But in this question, processing is bound to particular record in source data, so we will need to match processed request with index of record we need for getting further details for final processing.
As we will see later, asynchronous processing does not honour order of requests, some are processed sooner and some later and map yields whatever is completed.
One option is to attach index of given url to the requests and use it later when processing returned data.
Complex sample of fetching and processing urls with preserving url indices
Note: following sample is rather complex, if you can live with solution provided above, skip this. But make sure you are not running into problems detected and resolved below (urls being modified, requests following redirects).
import erequests
from itertools import count, izip
from functools import partial
urls = [
'http://www.heroku.com',
'http://python-tablib.org',
'http://httpbin.org',
'http://python-requests.org',
'http://kennethreitz.com'
]
def print_url_index(index, req, *args, **kwargs):
content_length = req.headers.get("content-length", None)
todo = "PROCESS" if req.status_code == 200 else "WAIT, NOT YET READY"
print "{todo}: index: {index}: status: {req.status_code}: length: {content_length}, {req.url}".format(**locals())
async_reqs = (erequests.async.get(url, hooks={"response": partial(print_url_index, i)}) for i, url in izip(count(), urls))
for req in erequests.map(async_reqs):
pass
Attaching hooks to request
requests (and erequests too) allows defining hooks to event called response. Each time, the request gets a response, this hook function is called and can do something or even modify the response.
Following line defines some hook to response:
erequests.async.get(url, hooks={"response": partial(print_url_index, i)})
Passing url index to the hook function
Signature of any hook shall be func(req, *args, *kwargs)
But we need to pass into the hook function also the index of url we are processing.
For this purpose we use functools.partial which allows creation of simplified functions by fixing some of parameters to specific value. This is exactly what we need, if you see print_url_index signature, we need just to fix value of index, the rest will fit requirements for hook function.
In our call we use partial with name of simplified function print_url_index and providing for each url unique index of it.
Index could be provided in the loop by enumerate, in case of larger number of parameters we may work more memory efficient way and use count, which generates each time incremented number starting by default from 0.
Let us run it:
$ python ereq.py
WAIT, NOT YET READY: index: 3: status: 301: length: 66, http://python-requests.org/
WAIT, NOT YET READY: index: 4: status: 301: length: 58, http://kennethreitz.com/
WAIT, NOT YET READY: index: 0: status: 301: length: None, http://www.heroku.com/
PROCESS: index: 2: status: 200: length: 7700, http://httpbin.org/
WAIT, NOT YET READY: index: 1: status: 301: length: 64, http://python-tablib.org/
WAIT, NOT YET READY: index: 4: status: 301: length: None, http://kennethreitz.org
WAIT, NOT YET READY: index: 3: status: 302: length: 0, http://docs.python-requests.org
WAIT, NOT YET READY: index: 1: status: 302: length: 0, http://docs.python-tablib.org
PROCESS: index: 3: status: 200: length: None, http://docs.python-requests.org/en/latest/
PROCESS: index: 1: status: 200: length: None, http://docs.python-tablib.org/en/latest/
PROCESS: index: 0: status: 200: length: 12064, https://www.heroku.com/
PROCESS: index: 4: status: 200: length: 10478, http://www.kennethreitz.org/
This shows, that:
requests are not processed in the order they were generated
some requests follow redirection, so hook function is called multiple times
carefully inspecting url values we can see, that no url from original list urls is reported by response, even for index 2 we got extra / appended. That is why simple lookup of response url in original list of urls would not help us.
When web-scraping we generally have two types of bottlenecks:
IO blocks - whenever we make a request, we need to wait for the server to respond, which can block our entire program.
CPU blocks - when parsing web scraped content, our code might be limited by CPU processing power.
CPU Speed
CPU blocks are an easy fix - we can spawn more processes. Generally, 1 CPU core can efficiently handle 1 process. So if our scraper is running on a machine that has 12 CPU cores we can spawn 12 processes for 12x speed boost:
from concurrent.futures import ProcessPoolExecutor
def parse(html):
... # CPU intensive parsing
htmls = [...]
with ProcessPoolExecutor() as executor:
for result in executor.map(parse, htmls):
print(result)
Python's ProcessPooolExecutor spawns optimal amount of threads (equal to CPU cores) and distributes task through them.
IO Speed
For IO-blocking we have more options as our goal is to get rid of useless waiting which can be done through threads, processes and asyncio loops.
If we're making thousands of requests we can't spawn hundreds of processes. Threads will be less expensive but still, there's a better option - asyncio loops.
Asyncio loops can execute tasks in no specific order. In other words, while task A is being blocked task B can take over the program. This is perfect for web scraping as there's very little overhead computing going on. We can scale to thousands requests in a single program.
Unfortunately, for asycio to work, we need to use python packages that support asyncio. For example, by using httpx and asyncio we can speed up our scraping significantly:
# comparing synchronous `requests`:
import requests
from time import time
_start = time()
for i in range(50):
request.get("http://httpbin.org/delay/1")
print(f"finished in: {time() - _start:.2f} seconds")
# finished in: 52.21 seconds
# versus asynchronous `httpx`
import httpx
import asyncio
from time import time
_start = time()
async def main():
async with httpx.AsyncClient() as client:
tasks = [client.get("http://httpbin.org/delay/1") for i in range(50)]
for response_future in asyncio.as_completed(tasks):
response = await response_future
print(f"finished in: {time() - _start:.2f} seconds")
asyncio.run(main())
# finished in: 3.55 seconds
Combining Both
With async code we can avoid IO-blocks and with processes we can scale up CPU intensive parsing - a perfect combo to optimize web scraping:
import asyncio
import multiprocessing
from concurrent.futures import ProcessPoolExecutor
from time import sleep, time
import httpx
async def scrape(urls):
"""this is our async scraper that scrapes"""
results = []
async with httpx.AsyncClient(timeout=httpx.Timeout(30.0)) as client:
scrape_tasks = [client.get(url) for url in urls]
for response_f in asyncio.as_completed(scrape_tasks):
response = await response_f
# emulate data parsing/calculation
sleep(0.5)
...
results.append("done")
return results
def scrape_wrapper(args):
i, urls = args
print(f"subprocess {i} started")
result = asyncio.run(scrape(urls))
print(f"subprocess {i} ended")
return result
def multi_process(urls):
_start = time()
batches = []
batch_size = multiprocessing.cpu_count() - 1 # let's keep 1 core for ourselves
print(f"scraping {len(urls)} urls through {batch_size} processes")
for i in range(0, len(urls), batch_size):
batches.append(urls[i : i + batch_size])
with ProcessPoolExecutor() as executor:
for result in executor.map(scrape_wrapper, enumerate(batches)):
print(result)
print("done")
print(f"multi-process finished in {time() - _start:.2f}")
def single_process(urls):
_start = time()
results = asyncio.run(scrape(urls))
print(f"single-process finished in {time() - _start:.2f}")
if __name__ == "__main__":
urls = ["http://httpbin.org/delay/1" for i in range(100)]
multi_process(urls)
# multi-process finished in 7.22
single_process(urls)
# single-process finished in 51.28
These foundation concepts sound complex, but once you narrow it down to the roots of the issue, the fixes are very straight and already present in Python!
For more details on this subject see my blog Web Scraping Speed: Processes, Threads and Async

urllib2 urlopen read timeout/block

Recently I am working on a tiny crawler for downloading images on a url.
I use openurl() in urllib2 with f.open()/f.write():
Here is the code snippet:
# the list for the images' urls
imglist = re.findall(regImg,pageHtml)
# iterate to download images
for index in xrange(1,len(imglist)+1):
img = urllib2.urlopen(imglist[index-1])
f = open(r'E:\OK\%s.jpg' % str(index), 'wb')
print('To Read...')
# potential timeout, may block for a long time
# so I wonder whether there is any mechanism to enable retry when time exceeds a certain threshold
f.write(img.read())
f.close()
print('Image %d is ready !' % index)
In the code above, the img.read() will potentially block for a long time, I hope to do some retry/re-open the image url operation under this issue.
I also concern on the efficient perspective of the code above, if the number of the images to be downloaded is somewhat big, using a thread pool to download them seems to be better.
Any suggestions? Thanks in advance.
p.s. I found the read() method on img object may cause blocking, so adding a timeout parameter to the urlopen() alone seems useless. But I found file object has no timeout version of read(). Any suggestions on this ? Thanks very much .
The urllib2.urlopen has a timeout parameter which is used for all blocking operations (connection buildup etc.)
This snippet is taken from one of my projects. I use a thread pool to download multiple files at once. It uses urllib.urlretrieve but the logic is the same. The url_and_path_list is a list of (url, path) tuples, the num_concurrent is the number of threads to be spawned, and the skip_existing skips downloading of files if they already exist in the filesystem.
def download_urls(url_and_path_list, num_concurrent, skip_existing):
# prepare the queue
queue = Queue.Queue()
for url_and_path in url_and_path_list:
queue.put(url_and_path)
# start the requested number of download threads to download the files
threads = []
for _ in range(num_concurrent):
t = DownloadThread(queue, skip_existing)
t.daemon = True
t.start()
queue.join()
class DownloadThread(threading.Thread):
def __init__(self, queue, skip_existing):
super(DownloadThread, self).__init__()
self.queue = queue
self.skip_existing = skip_existing
def run(self):
while True:
#grabs url from queue
url, path = self.queue.get()
if self.skip_existing and exists(path):
# skip if requested
self.queue.task_done()
continue
try:
urllib.urlretrieve(url, path)
except IOError:
print "Error downloading url '%s'." % url
#signals to queue job is done
self.queue.task_done()
When you create tje connection with urllib2.urlopen(), you can give a timeout parameter.
As described in the doc :
The optional timeout parameter specifies a timeout in seconds for
blocking operations like the connection attempt (if not specified, the
global default timeout setting will be used). This actually only works
for HTTP, HTTPS and FTP connections.
With this you will be able to manage a maximum waiting duration and catch the exception raised.
The way I crawl a huge batch of documents is by having batch processor which crawls and dumps constant sized chunks.
Suppose you are to crawl a pre-known batch of say 100K documents. You can have some logic to generate constant size chunks of say 1000 documents which would be downloaded by a threadpool. Once the whole chunk is crawled, you can have bulk insert in your database. And then proceed with further 1000 documents and so on.
Advantages you get by following this approach:
You get the advantage of threadpool speeding up your crawl rate.
Its fault tolerant in the sense, you can continue from the chunk where it last failed.
You can have chunks generated on the basis of priority i.e. important documents to crawl first. So in case you are unable to complete the whole batch. Important documents are processed and less important documents can be picked up later on the next run.
An ugly hack that seems to work.
import os, socket, threading, errno
def timeout_http_body_read(response, timeout = 60):
def murha(resp):
os.close(resp.fileno())
resp.close()
# set a timer to yank the carpet underneath the blocking read() by closing the os file descriptor
t = threading.Timer(timeout, murha, (response,))
try:
t.start()
body = response.read()
t.cancel()
except socket.error as se:
if se.errno == errno.EBADF: # murha happened
return (False, None)
raise
return (True, body)

Urllib2 & BeautifulSoup : Nice couple but too slow - urllib3 & threads?

I was looking to find a way to optimize my code when I heard some good things about threads and urllib3. Apparently, people disagree which solution is the best.
The problem with my script below is the execution time: so slow!
Step 1: I fetch this page
http://www.cambridgeesol.org/institutions/results.php?region=Afghanistan&type=&BULATS=on
Step 2: I parse the page with BeautifulSoup
Step 3: I put the data in an excel doc
Step 4: I do it again, and again, and again for all the countries in my list (big list)
(I am just changing "Afghanistan" in the url to another country)
Here is my code:
ws = wb.add_sheet("BULATS_IA") #We add a new tab in the excel doc
x = 0 # We need x and y for pulling the data into the excel doc
y = 0
Countries_List = ['Afghanistan','Albania','Andorra','Argentina','Armenia','Australia','Austria','Azerbaijan','Bahrain','Bangladesh','Belgium','Belize','Bolivia','Bosnia and Herzegovina','Brazil','Brunei Darussalam','Bulgaria','Cameroon','Canada','Central African Republic','Chile','China','Colombia','Costa Rica','Croatia','Cuba','Cyprus','Czech Republic','Denmark','Dominican Republic','Ecuador','Egypt','Eritrea','Estonia','Ethiopia','Faroe Islands','Fiji','Finland','France','French Polynesia','Georgia','Germany','Gibraltar','Greece','Grenada','Hong Kong','Hungary','Iceland','India','Indonesia','Iran','Iraq','Ireland','Israel','Italy','Jamaica','Japan','Jordan','Kazakhstan','Kenya','Kuwait','Latvia','Lebanon','Libya','Liechtenstein','Lithuania','Luxembourg','Macau','Macedonia','Malaysia','Maldives','Malta','Mexico','Monaco','Montenegro','Morocco','Mozambique','Myanmar (Burma)','Nepal','Netherlands','New Caledonia','New Zealand','Nigeria','Norway','Oman','Pakistan','Palestine','Papua New Guinea','Paraguay','Peru','Philippines','Poland','Portugal','Qatar','Romania','Russia','Saudi Arabia','Serbia','Singapore','Slovakia','Slovenia','South Africa','South Korea','Spain','Sri Lanka','Sweden','Switzerland','Syria','Taiwan','Thailand','Trinadad and Tobago','Tunisia','Turkey','Ukraine','United Arab Emirates','United Kingdom','United States','Uruguay','Uzbekistan','Venezuela','Vietnam']
Longueur = len(Countries_List)
for Countries in Countries_List:
y = 0
htmlSource = urllib.urlopen("http://www.cambridgeesol.org/institutions/results.php?region=%s&type=&BULATS=on" % (Countries)).read() # I am opening the page with the name of the correspondant country in the url
s = soup(htmlSource)
tableGood = s.findAll('table')
try:
rows = tableGood[3].findAll('tr')
for tr in rows:
cols = tr.findAll('td')
y = 0
x = x + 1
for td in cols:
hum = td.text
ws.write(x,y,hum)
y = y + 1
wb.save("%s.xls" % name_excel)
except (IndexError):
pass
So I know that all is not perfect but I am looking forward to learn new things in Python ! The script is very slow because urllib2 is not that fast, and BeautifulSoup. For the soup thing, I guess I can't really make it better, but for urllib2, I don't.
EDIT 1 :
Multiprocessing useless with urllib2?
Seems to be interesting in my case.
What do you guys think about this potential solution ?!
# Make sure that the queue is thread-safe!!
def producer(self):
# Only need one producer, although you could have multiple
with fh = open('urllist.txt', 'r'):
for line in fh:
self.queue.enqueue(line.strip())
def consumer(self):
# Fire up N of these babies for some speed
while True:
url = self.queue.dequeue()
dh = urllib2.urlopen(url)
with fh = open('/dev/null', 'w'): # gotta put it somewhere
fh.write(dh.read())
EDIT 2: URLLIB3
Can anyone tell me more things about that ?
Re-use the same socket connection for multiple requests
(HTTPConnectionPool and HTTPSConnectionPool) (with optional
client-side certificate verification).
https://github.com/shazow/urllib3
As far as I am requesting 122 times the same website for different pages, I guess reusing the same socket connection can be interesting, am I wrong ?
Cant it be faster ? ...
http = urllib3.PoolManager()
r = http.request('GET', 'http://www.bulats.org')
for Pages in Pages_List:
r = http.request('GET', 'http://www.bulats.org/agents/find-an-agent?field_continent_tid=All&field_country_tid=All&page=%s' % (Pages))
s = soup(r.data)
Consider using something like workerpool. Referring to the Mass Downloader example, combined with urllib3 would look something like:
import workerpool
import urllib3
URL_LIST = [] # Fill this from somewhere
NUM_SOCKETS = 3
NUM_WORKERS = 5
# We want a few more workers than sockets so that they have extra
# time to parse things and such.
http = urllib3.PoolManager(maxsize=NUM_SOCKETS)
workers = workerpool.WorkerPool(size=NUM_WORKERS)
class MyJob(workerpool.Job):
def __init__(self, url):
self.url = url
def run(self):
r = http.request('GET', self.url)
# ... do parsing stuff here
for url in URL_LIST:
workers.put(MyJob(url))
# Send shutdown jobs to all threads, and wait until all the jobs have been completed
# (If you don't do this, the script might hang due to a rogue undead thread.)
workers.shutdown()
workers.wait()
You may note from the Mass Downloader examples that there are multiple ways of doing this. I chose this particular example just because it's less magical, but any of the other strategies are valid also.
Disclaimer: I am the author of both, urllib3 and workerpool.
I don't think urllib or BeautifulSoup is slow. I run your code in my local machine with a modified version ( removed the excel stuff ). It took around 100ms to open the connection, download the content, parse it , and print it to the console for a country.
10ms is the total amount of time that BeautifulSoup spent to parse the content, and print to the console per country. That is fast enough.
Neither I do believe using Scrappy or Threading is going to solve the problem. Because the problem is the expectation that it is going to be fast.
Welcome to the world of HTTP. It is going to be slow sometimes, sometimes it will be very fast. Couple of slow connection reasons
because of the server handling your request( return 404 sometimes )
DNS resolve ,
HTTP handshake,
your ISP's connection stability,
your bandwidth rate,
packet loss rate
etc..
Don't forget, you are trying to make 121 HTTP Requests to a server consequently and you don't know what kind of servers do they have. They might also ban your IP address because of consequent calls.
Take a look at Requests lib. Read their documentation. If you're doing this to learn Python more, don't jump into Scrapy directly.
Hey Guys,
Some news from the problem ! I've found this script, which might be useful ! I'm actually testing it and it's promising (6.03 to run the script below).
My idea is to find a way to mix that with urllib3. In effet, I'm making request on the same host a lot of times.
The PoolManager will take care of reusing connections for you whenever
you request the same host. this should cover most scenarios without
significant loss of efficiency, but you can always drop down to a
lower level component for more granular control. (urrlib3 doc site)
Anyway, it seems to be very interesting and if I can't see yet how to mix these two functionnalities (urllib3 and the threading script below), I guess it's doable ! :-)
Thank you very much for taking the time to give me a hand with that, It smells good !
import Queue
import threading
import urllib2
import time
from bs4 import BeautifulSoup as BeautifulSoup
hosts = ["http://www.bulats.org//agents/find-an-agent?field_continent_tid=All&field_country_tid=All", "http://www.bulats.org//agents/find-an-agent?field_continent_tid=All&field_country_tid=All&page=1", "http://www.bulats.org//agents/find-an-agent?field_continent_tid=All&field_country_tid=All&page=2", "http://www.bulats.org//agents/find-an-agent?field_continent_tid=All&field_country_tid=All&page=3", "http://www.bulats.org//agents/find-an-agent?field_continent_tid=All&field_country_tid=All&page=4", "http://www.bulats.org//agents/find-an-agent?field_continent_tid=All&field_country_tid=All&page=5", "http://www.bulats.org//agents/find-an-agent?field_continent_tid=All&field_country_tid=All&page=6"]
queue = Queue.Queue()
out_queue = Queue.Queue()
class ThreadUrl(threading.Thread):
"""Threaded Url Grab"""
def __init__(self, queue, out_queue):
threading.Thread.__init__(self)
self.queue = queue
self.out_queue = out_queue
def run(self):
while True:
#grabs host from queue
host = self.queue.get()
#grabs urls of hosts and then grabs chunk of webpage
url = urllib2.urlopen(host)
chunk = url.read()
#place chunk into out queue
self.out_queue.put(chunk)
#signals to queue job is done
self.queue.task_done()
class DatamineThread(threading.Thread):
"""Threaded Url Grab"""
def __init__(self, out_queue):
threading.Thread.__init__(self)
self.out_queue = out_queue
def run(self):
while True:
#grabs host from queue
chunk = self.out_queue.get()
#parse the chunk
soup = BeautifulSoup(chunk)
#print soup.findAll(['table'])
tableau = soup.find('table')
rows = tableau.findAll('tr')
for tr in rows:
cols = tr.findAll('td')
for td in cols:
texte_bu = td.text
texte_bu = texte_bu.encode('utf-8')
print texte_bu
#signals to queue job is done
self.out_queue.task_done()
start = time.time()
def main():
#spawn a pool of threads, and pass them queue instance
for i in range(5):
t = ThreadUrl(queue, out_queue)
t.setDaemon(True)
t.start()
#populate queue with data
for host in hosts:
queue.put(host)
for i in range(5):
dt = DatamineThread(out_queue)
dt.setDaemon(True)
dt.start()
#wait on the queue until everything has been processed
queue.join()
out_queue.join()
main()
print "Elapsed Time: %s" % (time.time() - start)

Categories