scan websites for content (fast)

scan websites for content (fast) - python

I have thousands of websites in a database and I want to search all of the websites for a specific string. What is the fastest way to do it? I think I should get the content of each website first - this would be the way I do it:
import urllib2, re
string = "search string"
source = urllib2.urlopen("http://website1.com").read()
if re.search(word,source):
print "My search string: "+string
and search for the string. But this is very slow. How can I accelerate it in python?

I don't think your issue is the program - it is the fact that you are executing an HTTP request for thousands of sites. You could investigate different solutions involving some sort of parallel processing, but regardless of how efficient you make the parsing code you are going to hit a bottleneck with the requests in your current implementation.
Here is a basic example that uses the Queue and threading modules. I would suggest reading up on the benefits of multiprocessing vs. multiple threads (such as the post mentioned by #JonathanV), but this will hopefully be somewhat helpful in understanding what is happening:
import Queue
import threading
import time
import urllib2
my_sites = [
'http://news.ycombinator.com',
'http://news.google.com',
'http://news.yahoo.com',
'http://www.cnn.com'
]
# Create a queue for our processing
queue = Queue.Queue()
class MyThread(threading.Thread):
"""Create a thread to make the url call."""
def __init__(self, queue):
super(MyThread, self).__init__()
self.queue = queue
def run(self):
while True:
# Grab a url from our queue and make the call.
my_site = self.queue.get()
url = urllib2.urlopen(my_site)
# Grab a little data to make sure it is working
print url.read(1024)
# Send the signal to indicate the task has completed
self.queue.task_done()
def main():
# This will create a 'pool' of threads to use in our calls
for _ in range(4):
t = MyThread(queue)
# A daemon thread runs but does not block our main function from exiting
t.setDaemon(True)
# Start the thread
t.start()
# Now go through our site list and add each url to the queue
for site in my_sites:
queue.put(site)
# join() ensures that we wait until our queue is empty before exiting
queue.join()
if __name__ == '__main__':
start = time.time()
main()
print 'Total Time: {0}'.format(time.time() - start)
For good resources on threading in particular, see Doug Hellmann's post here, an IBM article here (this has become my general threading setup as evidenced by the above) and the actual docs here.

Try looking into using multiprocessing to run multiple searches at the same time. Mutlithreading works too but the shared memory can turn into a curse if not managed properly. Take a look at this discussion to help you see which choice would work for you.

Related

Python Requests: Don't wait for request to finish

In Bash, it is possible to execute a command in the background by appending &. How can I do it in Python?
while True:
data = raw_input('Enter something: ')
requests.post(url, data=data) # Don't wait for it to finish.
print('Sending POST request...') # This should appear immediately.

Here's a hacky way to do it:
try:
requests.get("http://127.0.0.1:8000/test/",timeout=0.0000000001)
except requests.exceptions.ReadTimeout:
pass
Edit: for those of you that observed that this will not await a response - that is my understanding of the question "fire and forget... do not wait for it to finish". There are much more thorough and complete ways to do it with threads or async if you need response context, error handling, etc.

I use multiprocessing.dummy.Pool. I create a singleton thread pool at the module level, and then use pool.apply_async(requests.get, [params]) to launch the task.
This command gives me a future, which I can add to a list with other futures indefinitely until I'd like to collect all or some of the results.
multiprocessing.dummy.Pool is, against all logic and reason, a THREAD pool and not a process pool.
Example (works in both Python 2 and 3, as long as requests is installed):
from multiprocessing.dummy import Pool
import requests
pool = Pool(10) # Creates a pool with ten threads; more threads = more concurrency.
# "pool" is a module attribute; you can be sure there will only
# be one of them in your application
# as modules are cached after initialization.
if __name__ == '__main__':
futures = []
for x in range(10):
futures.append(pool.apply_async(requests.get, ['http://example.com/']))
# futures is now a list of 10 futures.
for future in futures:
print(future.get()) # For each future, wait until the request is
# finished and then print the response object.
The requests will be executed concurrently, so running all ten of these requests should take no longer than the longest one. This strategy will only use one CPU core, but that shouldn't be an issue because almost all of the time will be spent waiting for I/O.

Elegant solution from Andrew Gorcester. In addition, without using futures, it is possible to use the callback and error_callback attributes (see
doc) in order to perform asynchronous processing:
def on_success(r: Response):
if r.status_code == 200:
print(f'Post succeed: {r}')
else:
print(f'Post failed: {r}')
def on_error(ex: Exception):
print(f'Post requests failed: {ex}')
pool.apply_async(requests.post, args=['http://server.host'], kwargs={'json': {'key':'value'},
callback=on_success, error_callback=on_error))

According to the doc, you should move to another library :
Blocking Or Non-Blocking?
With the default Transport Adapter in place, Requests does not provide
any kind of non-blocking IO. The Response.content property will block
until the entire response has been downloaded. If you require more
granularity, the streaming features of the library (see Streaming
Requests) allow you to retrieve smaller quantities of the response at
a time. However, these calls will still block.
If you are concerned about the use of blocking IO, there are lots of
projects out there that combine Requests with one of Python’s
asynchronicity frameworks.
Two excellent examples are
grequests and
requests-futures.

Simplest and Most Pythonic Solution using threading
A Simple way to go ahead and send POST/GET or to execute any other function without waiting for it to finish is using the built-in Python Module threading.
import threading
import requests
def send_req():
requests.get("http://127.0.0.1:8000/test/")
for x in range(100):
threading.Thread(target=send_req).start() # start's a new thread and continues.
Other Important Features of threading
You can turn these threads into daemons using thread_obj.daemon = True
You can go ahead and wait for one to complete executing and then continue using thread_obj.join()
You can check if a thread is alive using thread_obj.is_alive() bool: True/False
You can even check the active thread count as well by threading.active_count()
Official Documentation

If you can write the code to be executed separately in a separate python program, here is a possible solution based on subprocessing.
Otherwise you may find useful this question and related answer: the trick is to use the threading library to start a separate thread that will execute the separated task.
A caveat with both approach could be the number of items (that's to say the number of threads) you have to manage. If the items in parent are too many, you may consider halting every batch of items till at least some threads have finished, but I think this kind of management is non-trivial.
For more sophisticated approach you can use an actor based approach, I have not used this library myself but I think it could help in that case.

from multiprocessing.dummy import Pool
import requests
pool = Pool()
def on_success(r):
print('Post succeed')
def on_error(ex):
print('Post requests failed')
def call_api(url, data, headers):
requests.post(url=url, data=data, headers=headers)
def pool_processing_create(url, data, headers):
pool.apply_async(call_api, args=[url, data, headers],
callback=on_success, error_callback=on_error)

Assistance with Python multithreading

Currently, i have a list of url to grab contents from and is doing it serially. I would like to change it to grabbing them in parallel. This is a psuedocode. I will like to ask is the design sound? I understand that .start() starts the thread, however, my database is not updated. Do i need to use q.get() ? thanks
import threading
import Queue
q = Queue.Queue()
def do_database(url):
""" grab url then input to database """
webdata = grab_url(url)
try:
insert_data_into_database(webdata)
except:
....
else:
< do I need to do anything with the queue after each db operation is done?>
def put_queue(q, url ):
q.put( do_database(url) )
for myfiles in currentdir:
url = myfiles + some_other_string
t=threading.Thread(target=put_queue,args=(q,url))
t.daemon=True
t.start()

It's odd that you're putting stuff into q but never taking anything out of q. What is the purpose of q? In addition, since do_database() doesn't return anything, sure looks like the only thing q.put(do_database(url)) does is put None into q.
The usual way these things work, a description of work to do is added to a queue, and then a fixed number of threads take turns pulling things off the queue. You probably don't want to create an unbounded number of threads ;-)
Here's a pretty complete - but untested - sketch:
import threading
import Queue
NUM_THREADS = 5 # whatever
q = Queue.Queue()
END_OF_DATA = object() # a unique object
class Worker(threading.Thread):
def run(self):
while True:
url = q.get()
if url is END_OF_DATA:
break
webdata = grab_url(url)
try:
# Does your database support concurrent updates
# from multiple threads? If not, need to put
# this in a "with some_global_mutex:" block.
insert_data_into_database(webdata)
except:
#....
threads = [Worker() for _ in range(NUM_THREADS)]
for t in threads:
t.start()
for myfiles in currentdir:
url = myfiles + some_other_string
q.put(url)
# Give each thread an END_OF_DATA marker.
for _ in range(NUM_THREADS):
q.put(END_OF_DATA)
# Shut down cleanly. `daemon` is way overused.
for t in threads:
t.join()

You should do this with asynchronous programming rather than threads. Threading in Python is problematic (see: Global Interpreter Lock), and anyway you're not trying to achieve multicore performance here. You just need a way to multiplex potentially long-running I/O. For that you can use a single thread and an event-driven library such as Twisted.
Twisted comes with HTTP functionality, so you can issue many concurrent requests and react (by populating your database) when results come in. Be aware that this model of programming may take a little getting used to, but it will give you good performance if the number of requests you're making is not astronomical (i.e. if you can get it all done on one machine, which it seems is your intention).

For DB, You have to commit before your changes become effective. But, commit for every insert is not optimal. Commit after bulk changes gives much better performance.
For parallel, Python isn't born for this. For your use-case, i suppose using python with gevent would be a painless solution.
Here is a much more efficient pseudo implementation FYI:
import gevent
from gevent.monkey import patch_all
patch_all() # to use with urllib, etc
from gevent.queue import Queue
def web_worker(q, url):
grab_something
q.push(result)
def db_worker(q):
buf = []
while True:
buf.append(q.get())
if len(buf) > 20:
insert_stuff_in_buf_to_db
db_commit
buf = []
def run(urls):
q = Queue()
gevent.spawn(db_worker, q)
for url in urls:
gevent.spawn(web_worker, q, url)
run(urls)
plus, since this implementation is totally single threaded, you can safely manipulate shared data between workers like queue, db connection, global variables etc.

Understanding Asynchronous/Multiprocessing in Python

Lets say I have a function:
from time import sleep
def doSomethingThatTakesALongTime(number):
print number
sleep(10)
and then I call it in a for loop
for number in range(10):
doSomethingThatTakesALongTime(number)
How can I set this up so that it only takes 10 seconds TOTAL to print out:
$ 0123456789
Instead of taking 100 seconds. If it helps, I'm going to use the information YOU provide to do asynchronous web scraping. i.e. I have a list of sites I want to visit, but I want to visit them simultaneously, rather than wait for each one to complete.

Try to use Eventlet — the first example of documentation shows how to implement simultaneous URL fetching:
urls = ["http://www.google.com/intl/en_ALL/images/logo.gif",
"https://wiki.secondlife.com/w/images/secondlife.jpg",
"http://us.i1.yimg.com/us.yimg.com/i/ww/beta/y3.gif"]
import eventlet
from eventlet.green import urllib2
def fetch(url):
return urllib2.urlopen(url).read()
pool = eventlet.GreenPool()
for body in pool.imap(fetch, urls):
print "got body", len(body)
I can also advise to look toward Celery for more flexible solution.

asyncoro supports asynchronous, concurrent programming. It includes asynchronous (non-blocking) socket implementation. If your implementation does not need urllib/httplib etc. (that don't have asynchronous completions), it may fit your purpose (and easy to use, as it is very similar to programming with threads). Your above problem with asyncoro:
import asyncoro
def do_something(number, coro=None):
print number
yield coro.sleep(10)
for number in range(10):
asyncoro.Coro(do_something, number)

Take a look at scrapy framework. It's intended specially for web scraping and is very good. It is asynchronus and built on twisted framework.
http://scrapy.org/

Just in case, this is the exact way to apply green threads to your example snippet:
from eventlet.green.time import sleep
from eventlet.greenpool import GreenPool
def doSomethingThatTakesALongTime(number):
print number
sleep(10)
pool = GreenPool()
for number in range(100):
pool.spawn_n(doSomethingThatTakesALongTime, number)
import timeit
print timeit.timeit("pool.waitall()", "from __main__ import pool")
# yields : 10.9335260363

Throughput differences when using coroutines vs threading

A few days ago I has asked a question on SO about helping me design a paradigm for structuring multiple HTTP requests
Here's the scenario. I would like a have a multi-producer, multi-consumer system. My producers crawl and scrape a few sites and add the links that it finds into a queue. Since I'll be crawling multiple sites, I would like to have multiple producers/crawlers.
The consumers/workers feed off this queue, make TCP/UDP requests to these links and saves the results to my Django DB. I would also like to have multiple-workers as each queue item is totally independent of each other.
People suggested that use a coroutine library for this i.e. Gevent or Eventlet. Having never worked with coroutines, I read that even though the programming paradigm is similar to threaded paradigms, only one thread is actively executing but when blocking calls occur - such as I/O calls - the stacks are switched in-memory and the other green thread takes over until it encounters some sort of a blocking I/O call. Hopefully I got this right? Here's the code from one of my SO posts:
import gevent
from gevent.queue import *
import time
import random
q = JoinableQueue()
workers = []
producers = []
def do_work(wid, value):
gevent.sleep(random.randint(0,2))
print 'Task', value, 'done', wid
def worker(wid):
while True:
item = q.get()
try:
print "Got item %s" % item
do_work(wid, item)
finally:
print "No more items"
q.task_done()
def producer():
while True:
item = random.randint(1, 11)
if item == 10:
print "Signal Received"
return
else:
print "Added item %s" % item
q.put(item)
for i in range(4):
workers.append(gevent.spawn(worker, random.randint(1, 100000)))
# This doesn't work.
for j in range(2):
producers.append(gevent.spawn(producer))
# Uncommenting this makes this script work.
# producer()
q.join()
This works well because the sleep calls are blocking calls and when a sleep event occurs, another green thread takes over. This is a lot faster than sequential execution.
As you can see, I don't have any code in my program that purposely yields the execution of one thread to another thread. I fail to see how this fits into scenario above as I would like to have all the threads executing simultaneously.
All works fine, but I feel the throughput that I've achieved using Gevent/Eventlets is higher than the original sequentially running program but drastically lower than what could be achieved using real-threading.
If I were to re-implement my program using threading mechanisms, each of my producers and consumers could simultaneously be working without the need to swap stacks in and out like coroutines.
Should this be re-implemented using threading? Is my design wrong? I've failed to see the real benefits of using coroutines.
Maybe my concepts are little muddy but this is what I've assimilated. Any help or clarification of my paradigm and concepts would be great.
Thanks

As you can see, I don't have any code in my program that purposely
yields the execution of one thread to another thread. I fail to see
how this fits into scenario above as I would like to have all the
threads executing simultaneously.
There is a single OS thread but several greenlets. In your case gevent.sleep() allows workers to execute concurrently. Blocking IO calls such as urllib2.urlopen(url).read() do the same if you use urllib2 patched to work with gevent (by calling gevent.monkey.patch_*()).
See also A Curious Course on Coroutines and Concurrency to understand how a code can work concurrently in a single threaded environment.
To compare throughput differences between gevent, threading, multiprocessing you could write the code that compatible with all aproaches:
#!/usr/bin/env python
concurrency_impl = 'gevent' # single process, single thread
##concurrency_impl = 'threading' # single process, multiple threads
##concurrency_impl = 'multiprocessing' # multiple processes
if concurrency_impl == 'gevent':
import gevent.monkey; gevent.monkey.patch_all()
import logging
import time
import random
from itertools import count, islice
info = logging.info
if concurrency_impl in ['gevent', 'threading']:
from Queue import Queue as JoinableQueue
from threading import Thread
if concurrency_impl == 'multiprocessing':
from multiprocessing import Process as Thread, JoinableQueue
The rest of the script is the same for all concurrency implementations:
def do_work(wid, value):
time.sleep(random.randint(0,2))
info("%d Task %s done" % (wid, value))
def worker(wid, q):
while True:
item = q.get()
try:
info("%d Got item %s" % (wid, item))
do_work(wid, item)
finally:
q.task_done()
info("%d Done item %s" % (wid, item))
def producer(pid, q):
for item in iter(lambda: random.randint(1, 11), 10):
time.sleep(.1) # simulate a green blocking call that yields control
info("%d Added item %s" % (pid, item))
q.put(item)
info("%d Signal Received" % (pid,))
Don't execute code at a module level put it in main():
def main():
logging.basicConfig(level=logging.INFO,
format="%(asctime)s %(process)d %(message)s")
q = JoinableQueue()
it = count(1)
producers = [Thread(target=producer, args=(i, q)) for i in islice(it, 2)]
workers = [Thread(target=worker, args=(i, q)) for i in islice(it, 4)]
for t in producers+workers:
t.daemon = True
t.start()
for t in producers: t.join() # put items in the queue
q.join() # wait while it is empty
# exit main thread (daemon workers die at this point)
if __name__=="__main__":
main()

gevent is great when you have very many (green) threads. I tested it with thousands and it worked very well. you have make sure all libraries you use both for scraping and for saving to the db get green. afaik if they use python's socket, gevent injection ought to work. extensions written in C (e.g. mysqldb) would block however and you'd need to use green equivalents instead.
if you use gevent you could mostly do away with queues, spawn new (green) thread for every task, code for the thread being as simple as db.save(web.get(address)). gevent will take care of preemption when some library in db or web blocks. it will work as long as your tasks fit in memory.

In this case, your problem is not with program speed (i.e choice of gevent or threading), but network IO throughput. That's (should be) the bottleneck that determines how fast the program runs.
Gevent is one nice way to make sure that is the bottleneck, and not your program's architecture.
This is the sort of process you'd want:
import gevent
from gevent.queue import Queue, JoinableQueue
from gevent.monkey import patch_all
patch_all() # Patch urllib2, etc
def worker(work_queue, output_queue):
for work_unit in work_queue:
finished = do_work(work_unit)
output_queue.put(finished)
work_queue.task_done()
def producer(input_queue, work_queue):
for url in input_queue:
url_list = crawl(url)
for work in url_list:
work_queue.put(work)
input_queue.task_done()
def do_work(work):
gevent.sleep(0) # Actually proces link here
return work
def crawl(url):
gevent.sleep(0)
return list(url) # Actually process url here
input = JoinableQueue()
work = JoinableQueue()
output = Queue()
workers = [gevent.spawn(worker, work, output) for i in range(0, 10)]
producers = [gevent.spawn(producer, input, work) for i in range(0, 10)]
list_of_urls = ['foo', 'bar']
for url in list_of_urls:
input.put(url)
# Wait for input to finish processing
input.join()
print 'finished producing'
# Wait for workers to finish processing work
work.join()
print 'finished working'
# We now have output!
print 'output:'
for message in output:
print message
# Or if you'd like, you could use the output as it comes!
You don't need to wait for input and work queues to finish, I've just demonstrated that here.

How to do a non-blocking URL fetch in Python

I am writing a GUI app in Pyglet that has to display tens to hundreds of thumbnails from the Internet. Right now, I am using urllib.urlretrieve to grab them, but this blocks each time until they are finished, and only grabs one at a time.
I would prefer to download them in parallel and have each one display as soon as it's finished, without blocking the GUI at any point. What is the best way to do this?
I don't know much about threads, but it looks like the threading module might help? Or perhaps there is some easy way I've overlooked.

You'll probably benefit from threading or multiprocessing modules. You don't actually need to create all those Thread-based classes by yourself, there is a simpler method using Pool.map:
from multiprocessing import Pool
def fetch_url(url):
# Fetch the URL contents and save it anywhere you need and
# return something meaningful (like filename or error code),
# if you wish.
...
pool = Pool(processes=4)
result = pool.map(f, image_url_list)

As you suspected, this is a perfect situation for threading. Here is a short guide I found immensely helpful when doing my own first bit of threading in python.

As you rightly indicated, you could create a number of threads, each of which is responsible for performing urlretrieve operations. This allows the main thread to continue uninterrupted.
Here is a tutorial on threading in python:
http://heather.cs.ucdavis.edu/~matloff/Python/PyThreads.pdf

Here's an example of how to use threading.Thread. Just replace the class name with your own and the run function with your own. Note that threading is great for IO restricted applications like your's and can really speed it up. Using pythong threading strictly for computation in standard python doesn't help because only one thread can compute at a time.
import threading, time
class Ping(threading.Thread):
def __init__(self, multiple):
threading.Thread.__init__(self)
self.multiple = multiple
def run(self):
#sleeps 3 seconds then prints 'pong' x times
time.sleep(3)
printString = 'pong' * self.multiple
pingInstance = Ping(3)
pingInstance.start() #your run function will be called with the start function
print "pingInstance is alive? : %d" % pingInstance.isAlive() #will return True, or 1
print "Number of threads alive: %d" % threading.activeCount()
#main thread + class instance
time.sleep(3.5)
print "Number of threads alive: %d" % threading.activeCount()
print "pingInstance is alive?: %d" % pingInstance.isAlive()
#isAlive returns false when your thread reaches the end of it's run function.
#only main thread now

You have these choices:
Threads: easiest but doesn't scale well
Twisted: medium difficulty, scales well but shares CPU due to GIL and being single threaded.
Multiprocessing: hardest. Scales well if you know how to write your own event loop.
I recommend just using threads unless you need an industrial scale fetcher.

You either need to use threads, or an asynchronous networking library such as Twisted. I suspect that using threads might be simpler in your particular use case.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

scan websites for content (fast) - python

Try looking into using multiprocessing to run multiple searches at the same time. Mutlithreading works too but the shared memory can turn into a curse if not managed properly. Take a look at this discussion to help you see which choice would work for you.

Related

Python Requests: Don't wait for request to finish

Assistance with Python multithreading

Understanding Asynchronous/Multiprocessing in Python

Throughput differences when using coroutines vs threading

How to do a non-blocking URL fetch in Python

Categories

Resources