Lets say I have a function:
from time import sleep
def doSomethingThatTakesALongTime(number):
print number
sleep(10)
and then I call it in a for loop
for number in range(10):
doSomethingThatTakesALongTime(number)
How can I set this up so that it only takes 10 seconds TOTAL to print out:
$ 0123456789
Instead of taking 100 seconds. If it helps, I'm going to use the information YOU provide to do asynchronous web scraping. i.e. I have a list of sites I want to visit, but I want to visit them simultaneously, rather than wait for each one to complete.
Try to use Eventlet — the first example of documentation shows how to implement simultaneous URL fetching:
urls = ["http://www.google.com/intl/en_ALL/images/logo.gif",
"https://wiki.secondlife.com/w/images/secondlife.jpg",
"http://us.i1.yimg.com/us.yimg.com/i/ww/beta/y3.gif"]
import eventlet
from eventlet.green import urllib2
def fetch(url):
return urllib2.urlopen(url).read()
pool = eventlet.GreenPool()
for body in pool.imap(fetch, urls):
print "got body", len(body)
I can also advise to look toward Celery for more flexible solution.
asyncoro supports asynchronous, concurrent programming. It includes asynchronous (non-blocking) socket implementation. If your implementation does not need urllib/httplib etc. (that don't have asynchronous completions), it may fit your purpose (and easy to use, as it is very similar to programming with threads). Your above problem with asyncoro:
import asyncoro
def do_something(number, coro=None):
print number
yield coro.sleep(10)
for number in range(10):
asyncoro.Coro(do_something, number)
Take a look at scrapy framework. It's intended specially for web scraping and is very good. It is asynchronus and built on twisted framework.
http://scrapy.org/
Just in case, this is the exact way to apply green threads to your example snippet:
from eventlet.green.time import sleep
from eventlet.greenpool import GreenPool
def doSomethingThatTakesALongTime(number):
print number
sleep(10)
pool = GreenPool()
for number in range(100):
pool.spawn_n(doSomethingThatTakesALongTime, number)
import timeit
print timeit.timeit("pool.waitall()", "from __main__ import pool")
# yields : 10.9335260363
Related
I am new to Python programming.
I am trying to parse the HTTP requests from Instagram to find a specific word using regular expressions.
I've used multiprocessing, but still it's SLOW. I know my code might look stupid, but that's my best.
What am I doing wrong that makes it slow? I need to make it send multiple HTTP requests faster.
import requests
import re
import time
from bs4 import BeautifulSoup
from multiprocessing.dummy import Pool
from multiprocessing import cpu_count
Nthreads = cpu_count()*2
pool = Pool(Nthreads)
f = open('full.txt','r')
fw = open('out.txt', 'w')
def findSnap(bio):
regex = 'content=".*sn[a]*p[a-z]*\s*[^a-z0-9].*'
snap = re.findall(regex, bio)
if not snap:
return None
else:
afterSnap = re.sub('content=".*sn[a]*p[a-z]*\s*[^a-z0-9]*\s*','',snap[0])
if afterSnap:
afterSnap = re.findall('[\w_\.-]*',afterSnap)[0]
sftS = afterSnap.split()
if sftS:
return sftS[0]
return None
return None
def loadInfo(url):
#print 'Loading data..'
st = time.time
try:
page = requests.get(url).text.lower()
except Exception as e:
print('Something is wrong!')
return None
snap = findSnap(page)
if snap:
fw.write(snap + '\n')
fw.flush()
print(snap)
else:
return None
return snap
start = time.time()
names = f.read().splitlines()
baseUrl = 'https://instagram.com/'
urls = map(lambda x: baseUrl + x, names)
pool.map(loadInfo, urls)
finish = time.time()
print((finish- start)/60)
fw.close()
As some people is saying, maybe we need some more details about what times are you getting, what are you expecting and why you expect that. Because there are a lot factors that can be involved in the execution time of your application, not just your code, because your application depends on a third-party resource.
In any case, I have seen you are using multiprocessing.dummy, this is just a wrapper of threading module [1]. Following its documentation, it seems it is not the best module you can use to run regular Python code at same time[2]:
CPython implementation detail: In CPython, due to the Global
Interpreter Lock, only one thread can execute Python code at once
(even though certain performance-oriented libraries might overcome
this limitation). If you want your application to make better use of
the computational resources of multi-core machines, you are advised to
use multiprocessing or concurrent.futures.ProcessPoolExecutor.
However, threading is still an appropriate model if you want to run
multiple I/O-bound tasks simultaneously.
It is true that in your case you are making I/O operations, but processing regex is also a heavy task.
As it is said in the text, you can try to use different implementations of pools
in multiprocessing module other than dummy or you can concurrent.futures.ProcessPoolExecutor also.
In Bash, it is possible to execute a command in the background by appending &. How can I do it in Python?
while True:
data = raw_input('Enter something: ')
requests.post(url, data=data) # Don't wait for it to finish.
print('Sending POST request...') # This should appear immediately.
Here's a hacky way to do it:
try:
requests.get("http://127.0.0.1:8000/test/",timeout=0.0000000001)
except requests.exceptions.ReadTimeout:
pass
Edit: for those of you that observed that this will not await a response - that is my understanding of the question "fire and forget... do not wait for it to finish". There are much more thorough and complete ways to do it with threads or async if you need response context, error handling, etc.
I use multiprocessing.dummy.Pool. I create a singleton thread pool at the module level, and then use pool.apply_async(requests.get, [params]) to launch the task.
This command gives me a future, which I can add to a list with other futures indefinitely until I'd like to collect all or some of the results.
multiprocessing.dummy.Pool is, against all logic and reason, a THREAD pool and not a process pool.
Example (works in both Python 2 and 3, as long as requests is installed):
from multiprocessing.dummy import Pool
import requests
pool = Pool(10) # Creates a pool with ten threads; more threads = more concurrency.
# "pool" is a module attribute; you can be sure there will only
# be one of them in your application
# as modules are cached after initialization.
if __name__ == '__main__':
futures = []
for x in range(10):
futures.append(pool.apply_async(requests.get, ['http://example.com/']))
# futures is now a list of 10 futures.
for future in futures:
print(future.get()) # For each future, wait until the request is
# finished and then print the response object.
The requests will be executed concurrently, so running all ten of these requests should take no longer than the longest one. This strategy will only use one CPU core, but that shouldn't be an issue because almost all of the time will be spent waiting for I/O.
Elegant solution from Andrew Gorcester. In addition, without using futures, it is possible to use the callback and error_callback attributes (see
doc) in order to perform asynchronous processing:
def on_success(r: Response):
if r.status_code == 200:
print(f'Post succeed: {r}')
else:
print(f'Post failed: {r}')
def on_error(ex: Exception):
print(f'Post requests failed: {ex}')
pool.apply_async(requests.post, args=['http://server.host'], kwargs={'json': {'key':'value'},
callback=on_success, error_callback=on_error))
According to the doc, you should move to another library :
Blocking Or Non-Blocking?
With the default Transport Adapter in place, Requests does not provide
any kind of non-blocking IO. The Response.content property will block
until the entire response has been downloaded. If you require more
granularity, the streaming features of the library (see Streaming
Requests) allow you to retrieve smaller quantities of the response at
a time. However, these calls will still block.
If you are concerned about the use of blocking IO, there are lots of
projects out there that combine Requests with one of Python’s
asynchronicity frameworks.
Two excellent examples are
grequests and
requests-futures.
Simplest and Most Pythonic Solution using threading
A Simple way to go ahead and send POST/GET or to execute any other function without waiting for it to finish is using the built-in Python Module threading.
import threading
import requests
def send_req():
requests.get("http://127.0.0.1:8000/test/")
for x in range(100):
threading.Thread(target=send_req).start() # start's a new thread and continues.
Other Important Features of threading
You can turn these threads into daemons using thread_obj.daemon = True
You can go ahead and wait for one to complete executing and then continue using thread_obj.join()
You can check if a thread is alive using thread_obj.is_alive() bool: True/False
You can even check the active thread count as well by threading.active_count()
Official Documentation
If you can write the code to be executed separately in a separate python program, here is a possible solution based on subprocessing.
Otherwise you may find useful this question and related answer: the trick is to use the threading library to start a separate thread that will execute the separated task.
A caveat with both approach could be the number of items (that's to say the number of threads) you have to manage. If the items in parent are too many, you may consider halting every batch of items till at least some threads have finished, but I think this kind of management is non-trivial.
For more sophisticated approach you can use an actor based approach, I have not used this library myself but I think it could help in that case.
from multiprocessing.dummy import Pool
import requests
pool = Pool()
def on_success(r):
print('Post succeed')
def on_error(ex):
print('Post requests failed')
def call_api(url, data, headers):
requests.post(url=url, data=data, headers=headers)
def pool_processing_create(url, data, headers):
pool.apply_async(call_api, args=[url, data, headers],
callback=on_success, error_callback=on_error)
Currently, i have a list of url to grab contents from and is doing it serially. I would like to change it to grabbing them in parallel. This is a psuedocode. I will like to ask is the design sound? I understand that .start() starts the thread, however, my database is not updated. Do i need to use q.get() ? thanks
import threading
import Queue
q = Queue.Queue()
def do_database(url):
""" grab url then input to database """
webdata = grab_url(url)
try:
insert_data_into_database(webdata)
except:
....
else:
< do I need to do anything with the queue after each db operation is done?>
def put_queue(q, url ):
q.put( do_database(url) )
for myfiles in currentdir:
url = myfiles + some_other_string
t=threading.Thread(target=put_queue,args=(q,url))
t.daemon=True
t.start()
It's odd that you're putting stuff into q but never taking anything out of q. What is the purpose of q? In addition, since do_database() doesn't return anything, sure looks like the only thing q.put(do_database(url)) does is put None into q.
The usual way these things work, a description of work to do is added to a queue, and then a fixed number of threads take turns pulling things off the queue. You probably don't want to create an unbounded number of threads ;-)
Here's a pretty complete - but untested - sketch:
import threading
import Queue
NUM_THREADS = 5 # whatever
q = Queue.Queue()
END_OF_DATA = object() # a unique object
class Worker(threading.Thread):
def run(self):
while True:
url = q.get()
if url is END_OF_DATA:
break
webdata = grab_url(url)
try:
# Does your database support concurrent updates
# from multiple threads? If not, need to put
# this in a "with some_global_mutex:" block.
insert_data_into_database(webdata)
except:
#....
threads = [Worker() for _ in range(NUM_THREADS)]
for t in threads:
t.start()
for myfiles in currentdir:
url = myfiles + some_other_string
q.put(url)
# Give each thread an END_OF_DATA marker.
for _ in range(NUM_THREADS):
q.put(END_OF_DATA)
# Shut down cleanly. `daemon` is way overused.
for t in threads:
t.join()
You should do this with asynchronous programming rather than threads. Threading in Python is problematic (see: Global Interpreter Lock), and anyway you're not trying to achieve multicore performance here. You just need a way to multiplex potentially long-running I/O. For that you can use a single thread and an event-driven library such as Twisted.
Twisted comes with HTTP functionality, so you can issue many concurrent requests and react (by populating your database) when results come in. Be aware that this model of programming may take a little getting used to, but it will give you good performance if the number of requests you're making is not astronomical (i.e. if you can get it all done on one machine, which it seems is your intention).
For DB, You have to commit before your changes become effective. But, commit for every insert is not optimal. Commit after bulk changes gives much better performance.
For parallel, Python isn't born for this. For your use-case, i suppose using python with gevent would be a painless solution.
Here is a much more efficient pseudo implementation FYI:
import gevent
from gevent.monkey import patch_all
patch_all() # to use with urllib, etc
from gevent.queue import Queue
def web_worker(q, url):
grab_something
q.push(result)
def db_worker(q):
buf = []
while True:
buf.append(q.get())
if len(buf) > 20:
insert_stuff_in_buf_to_db
db_commit
buf = []
def run(urls):
q = Queue()
gevent.spawn(db_worker, q)
for url in urls:
gevent.spawn(web_worker, q, url)
run(urls)
plus, since this implementation is totally single threaded, you can safely manipulate shared data between workers like queue, db connection, global variables etc.
I have thousands of websites in a database and I want to search all of the websites for a specific string. What is the fastest way to do it? I think I should get the content of each website first - this would be the way I do it:
import urllib2, re
string = "search string"
source = urllib2.urlopen("http://website1.com").read()
if re.search(word,source):
print "My search string: "+string
and search for the string. But this is very slow. How can I accelerate it in python?
I don't think your issue is the program - it is the fact that you are executing an HTTP request for thousands of sites. You could investigate different solutions involving some sort of parallel processing, but regardless of how efficient you make the parsing code you are going to hit a bottleneck with the requests in your current implementation.
Here is a basic example that uses the Queue and threading modules. I would suggest reading up on the benefits of multiprocessing vs. multiple threads (such as the post mentioned by #JonathanV), but this will hopefully be somewhat helpful in understanding what is happening:
import Queue
import threading
import time
import urllib2
my_sites = [
'http://news.ycombinator.com',
'http://news.google.com',
'http://news.yahoo.com',
'http://www.cnn.com'
]
# Create a queue for our processing
queue = Queue.Queue()
class MyThread(threading.Thread):
"""Create a thread to make the url call."""
def __init__(self, queue):
super(MyThread, self).__init__()
self.queue = queue
def run(self):
while True:
# Grab a url from our queue and make the call.
my_site = self.queue.get()
url = urllib2.urlopen(my_site)
# Grab a little data to make sure it is working
print url.read(1024)
# Send the signal to indicate the task has completed
self.queue.task_done()
def main():
# This will create a 'pool' of threads to use in our calls
for _ in range(4):
t = MyThread(queue)
# A daemon thread runs but does not block our main function from exiting
t.setDaemon(True)
# Start the thread
t.start()
# Now go through our site list and add each url to the queue
for site in my_sites:
queue.put(site)
# join() ensures that we wait until our queue is empty before exiting
queue.join()
if __name__ == '__main__':
start = time.time()
main()
print 'Total Time: {0}'.format(time.time() - start)
For good resources on threading in particular, see Doug Hellmann's post here, an IBM article here (this has become my general threading setup as evidenced by the above) and the actual docs here.
Try looking into using multiprocessing to run multiple searches at the same time. Mutlithreading works too but the shared memory can turn into a curse if not managed properly. Take a look at this discussion to help you see which choice would work for you.
I am writing a GUI app in Pyglet that has to display tens to hundreds of thumbnails from the Internet. Right now, I am using urllib.urlretrieve to grab them, but this blocks each time until they are finished, and only grabs one at a time.
I would prefer to download them in parallel and have each one display as soon as it's finished, without blocking the GUI at any point. What is the best way to do this?
I don't know much about threads, but it looks like the threading module might help? Or perhaps there is some easy way I've overlooked.
You'll probably benefit from threading or multiprocessing modules. You don't actually need to create all those Thread-based classes by yourself, there is a simpler method using Pool.map:
from multiprocessing import Pool
def fetch_url(url):
# Fetch the URL contents and save it anywhere you need and
# return something meaningful (like filename or error code),
# if you wish.
...
pool = Pool(processes=4)
result = pool.map(f, image_url_list)
As you suspected, this is a perfect situation for threading. Here is a short guide I found immensely helpful when doing my own first bit of threading in python.
As you rightly indicated, you could create a number of threads, each of which is responsible for performing urlretrieve operations. This allows the main thread to continue uninterrupted.
Here is a tutorial on threading in python:
http://heather.cs.ucdavis.edu/~matloff/Python/PyThreads.pdf
Here's an example of how to use threading.Thread. Just replace the class name with your own and the run function with your own. Note that threading is great for IO restricted applications like your's and can really speed it up. Using pythong threading strictly for computation in standard python doesn't help because only one thread can compute at a time.
import threading, time
class Ping(threading.Thread):
def __init__(self, multiple):
threading.Thread.__init__(self)
self.multiple = multiple
def run(self):
#sleeps 3 seconds then prints 'pong' x times
time.sleep(3)
printString = 'pong' * self.multiple
pingInstance = Ping(3)
pingInstance.start() #your run function will be called with the start function
print "pingInstance is alive? : %d" % pingInstance.isAlive() #will return True, or 1
print "Number of threads alive: %d" % threading.activeCount()
#main thread + class instance
time.sleep(3.5)
print "Number of threads alive: %d" % threading.activeCount()
print "pingInstance is alive?: %d" % pingInstance.isAlive()
#isAlive returns false when your thread reaches the end of it's run function.
#only main thread now
You have these choices:
Threads: easiest but doesn't scale well
Twisted: medium difficulty, scales well but shares CPU due to GIL and being single threaded.
Multiprocessing: hardest. Scales well if you know how to write your own event loop.
I recommend just using threads unless you need an industrial scale fetcher.
You either need to use threads, or an asynchronous networking library such as Twisted. I suspect that using threads might be simpler in your particular use case.