Python: Architecture for url polling and posting

Python: Architecture for url polling and posting - python

I have a simple problem. I have to fetch a url (about once a minute), check if there is any new content, and if there is, post it to another url.
I have a working system with a cronjob every minute that basically:
for link in models.Link.objects.filter(enabled=True).select_related():
# do it in two phases in case there is cross pollination
# get posts
twitter_posts, meme_posts = [], []
if link.direction == "t2m" or link.direction == "both":
twitter_posts = utils.get_twitter_posts(link)
if link.direction == "m2t" or link.direction == "both":
meme_posts = utils.get_meme_posts(link)
# process them
if len(twitter_posts) > 0:
post_count += views.twitter_link(link, twitter_posts)
if len(meme_posts) > 0:
post_count += views.meme_link(link, meme_posts)
count += 1
msg = "%s links crawled and %s posts updated" % (count, post_count)
This works great for the 150 users I have now, but the synchronousness of it scares me. I have url timeouts built-in, but at some point my cronjob will take > 1 minute, and I'll be left with a million of them running overwriting eachother.
So, how should I rewrite it?
Some issues:
I don't want to hit the APIs too hard incase they block me. So I'd like to have at most 5 open connections to any API at any time.
Users keep registering in the system as this runs, so I need some way to add them
I'd like this to scale as well as possible
I'd like to reuse as much existing code as I can
So, some thoughts I've had:
Spawn a thread for each link
Use python-twisted - Keep one running process, that the cronjob just makes sure is running.
Use stackless - Don't really know much about it.
Ask StackOverflow :)
How would you do this?

Simplest: use a long-running process with sched (on its own thread) to handle the scheduling -- by posting requests to a Queue; have a fixed-size pool of threads (you can find a pre-made thread pool here, but it's easy to tweak it or roll your own) taking requests from the Queue (and returning results via a separate Queue). Registration and other system functions can be handled by a few more dedicated threads, if need be.
Threads aren't so bad, as long as (a) you never have to worry about synchronization among them (just have them communicate by intrinsically thread-safe Queue instances, never sharing access to any structure or subsystem that isn't strictly read-only), and (b) you never have too many (use a few dedicated threads for specialized functions, including scheduling, and a small thread-pool for general work -- never spawn a thread per request or anything like that, that will explode).
Twisted can be more scalable (at low hardware costs), but if you hinge your architecture on threading (and Queues) you have a built-in way to grow the system (by purchasing more hardware) to use the very similar multiprocessing module instead... almost a drop-in replacement, and a potential scaling up of orders of magnitude!-)

Related

Is this running truly parallel?

I'm using requests and threading in python to do some stuff. My question is: Is this code running truly multithreaded and is it safe to use? I'm experiencing some slow down over time. Note: I'm not using this exact code but mine is doing similar things.
import time
import requests
current_threads = 0
max_threads = 32
def doStuff():
global current_threads
r = requests.get('https://google.de')
current_threads-=1
while True:
while current_threads >= max_threads:
time.sleep(0.05)
thread = threading.Thread(target = doStuff)
thread.start()
current_threads+=1

There could be a number of reasons for the issue you are facing. I'm not an expert in Python but I can see the potential for a number of causes for the slow down. Potential reasons I can think of are as follows:
Depending on the size of the data you are pulling down you could potentially be overloading your bandwidth. Hard one to prove without seeing the exact code you are are using and what your code is doing and knowing your bandwidth.
Kinda connected to the fist one but if your files are taking some time to come down per thread it maybe getting clogged up at the:
while current_threads >= max_threads:
time.sleep(0.05)
you could try reducing the max number of threads and see if that helps though it may not if it's the files that are taking time to download.
The problem may not be with your code or your bandwidth but with the server you are pulling the files from, if that server is over loaded it maybe slowing down your transfers.
Firewalls, IPS, IDS, Policys on the server maybe throttling your requests. If you make too many requests to quickly all from the same IP the server side network equipment may mistake this as some sort of DoS attack and throttle your requests in response.
Unfortunately Python, as compared to other lower level languages such as C# or C++ is not as good at multithreading. This is due to something called the GIL (Global Interpreter Lock) which comes into play when you are accessing/manipulating the same data in multiple threads. This is quite a sizeable subject in it's self but if you want to read up on it have a look at this link.
https://medium.com/practo-engineering/threading-vs-multiprocessing-in-python-7b57f224eadb
Sorry I can't be of any more assistance but this is as much as I can say on the subject given the provided information.

Sure, you're running multiple threads and provided they're not accessing/mutating the same resources you're probably "safe".
Whenever I'm accessing external resources (ie, using requests), I always recommend asyncio over vanilla threading, as it allows custom context switching (everywhere you have an "await" you switch contexts, whereas in vanilla threading switching between threads is determined by the OS and might not be optimal) and reduced overhead (you're only using ONE thread).

Best way of parallelising this webcrawling loop?

I am making a webcrawler, and I have some "sleep" functions that make the crawl quite long.
For now I am doing :
for speciality in range(1,25):
for year in range(1997, 2017):
for quarter in [1,2]:
deal_with (driver, year, quarter, speciality, ok)
The deal_with function is opening several webpages, waiting a few second for complete html download before moving on. The execution time is then very long : there is 25 * 10 * 2 = 500 loops, with no less than a minute by loop.
I would like to use my 4 physical Cores (8 threads) to enjoy parallelism.
I read about tornado, multiprocessing, joblib... and can't really make my mind on an easy solution to adapt to my code.
Any insight welcome :-)

tl;dr Investing in any choice without fully understanding the bottlenecks you are facing will not help you.
At the end of the day, there are only two fundamental approaches to scaling out a task like this:
Multiprocessing
You launch a number of Python processes, and distribute tasks to each of them. This is the approach you think will help you right now.
Some sample code for how this works, though you could use any appropriate wrapper:
import multiprocessing
# general rule of thumb: launch twice as many processes as cores
process_pool = multiprocessing.Pool(8) # launches 8 processes
# generate a list of all inputs you wish to feed to this pool
inputs = []
for speciality in range(1,25):
for year in range(1997, 2017):
for quarter in [1,2]:
inputs.append((driver, year, quarter, speciality, ok))
# feed your list of inputs to your process_pool and print it when done
print(process_pool.map(deal_with, inputs))
If this is all you wanted, you can stop reading now.
Asynchronous Execution
Here, you are content with a single thread or process, but you don't want it to be sitting idle waiting for stuff like network reads or disk seeks to come back - you want it to go on and do other, more important things while it's waiting.
True native asynchronous I/O support is provided in Python 3 and does not exist in Python 2.7 outside of the Twisted networking library.
import concurrent.futures
# generate a list of all inputs you wish to feed to this pool
inputs = []
for speciality in range(1,25):
for year in range(1997, 2017):
for quarter in [1,2]:
inputs.append((driver, year, quarter, speciality, ok))
# produce a pool of processes, and make sure they don't block each other
# - get back an object representing something yet to be resolved, that will
# only be updated when data comes in.
with concurrent.futures.ProcessPoolExecutor() as executor:
outputs = [executor.submit(input_tuple) for input_tuple in inputs]
# wait for all of them to finish - not ideal, since it defeats the purpose
# in production, but sufficient for an example
for future_object in concurrent.futures.as_completed(outputs):
# do something with future_object.result()
So What's the Difference?
My main point here it to emphasise that choosing from a list of technologies isn't as hard as figuring out where the real bottleneck is.
In the examples above, there isn't any difference. Both follow a simple pattern:
Have a lot of workers
Allow these workers to pick something from a queue of tasks right away
When one is free, set them to work on the next one right away.
Thus, you gain no conceptual difference altogether if you follow these examples verbatim, even though they use entirely different technologies and claim to use entirely different techniques.
Any technology you pick will be for naught if you write it in this pattern - even though you'll get some speedup, you will be sorely disappointed if you expected a massive performance boost.
Why is this pattern bad? Because it doesn't solve your problem.
Your problem is simple: you have wait. While your process is waiting for something to come back, it can't do anything else! It can't call more pages for you. It can't process an incoming task. All it can do is wait.
Having more processes that ultimately wait is not the true solution. An army of troops that has to march to Waterloo will not be faster if you split it into regiments - each regiment eventually has to sleep, though they may sleep at different times and for different lengths, and what will happen is that all of them will arrive at almost roughly the same time.
What you need is an army that never sleeps.
So What Should You Do?
Abstract all I/O bound tasks into something non-blocking. This is your true bottleneck. If you're waiting for a network response, don't let the poor process just sit there - give it something to do.
Your task is made somewhat difficult in that by default reading from a socket is blocking. It's the way operating systems are. Thankfully, you don't need to get Python 3 to solve it (though that is always the preferred solution) - the asyncore library (though Twisted is comparably superior in every way) already exists in Python 2.7 to make network reads and writes truly in the background.
There is one and only one case where true multiprocessing needs to be used in Python, and that's if you are doing CPU-bound or CPU-intensive work. From your description, it doesn't sound like that's the case.
In short, you should edit your deal_with function to avoid the incipient wait. Make that wait in the background, if needed, using a suitable abstraction from Twisted or asyncore. But don't make it consume your process completely.

If you're using python3, I would check out the asycio module. I believe you can just decorate deal_with with #asyncio.coroutine. You will likely have to adjust what deal_with does to properly work with the event loop as well.

How to make concurrent web service calls from a Django view? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Asynchronous HTTP calls in Python
I have a Django view which needs to retrieve search results from multiple web services, blend the results together, and render them. I've never done any multithreading in Django before. What is a modern, efficient, safe way of doing this?
I don't know anything about it yet, but gevent seems like a reasonable option. Should I use that? Does it play well with Django? Should I look elsewhere?

Not sure about gevent. The simplest way is to use threads[*]. Here's a simple example of how to use threads in Python:
# std lib modules. "Batteries included" FTW.
import threading
import time
thread_result = -1
def ThreadWork():
global thread_result
thread_result = 1 + 1
time.sleep(5) # phew, I'm tiered after all that addition!
my_thread = threading.Thread(target=ThreadWork)
my_thread.start() # This will call ThreadWork in the background.
# In the mean time, you can do other stuff
y = 2 * 5 # Completely independent calculation.
my_thread.join() # Wait for the thread to finish doing it's thing.
# This should take about 5 seconds,
# due to time.sleep being called
print "thread_result * y =", thread_result * y
You can start multiple threads, have each make different web service calls, and join on all of those threads. Once all those join calls have returned, the results are in, and you'll be able to blend them.
more advanced tips: You should call join with a timeout; otherwise, your users might be waiting indefinitely for your app to send them a response. Even better would be for you to make those web service calls before the request arrives at your app; otherwise, the responsiveness of your app is at the mercy of the services that you rely on.
caveat about threading in general: Be careful with data that can be accessed by two (or more) different threads. Access to the same data needs to be "synchronized". The most popular synchronization device is a lock, but there is a plethora of others. threading.Lock implements a lock. If you're not careful about synchronization, you're likely to write a "race condition" into your app. Such bugs are notoriously difficult to debug, because they cannot be reliably reproduced.
In my simple example, thread_result was shared between my_thread and the main thread. I didn't need any locks, because the main thread did not access thread_result until my_thread terminated. If I hadn't called my_thread.join, the result would some times be -10 instead of 20. Go ahead and try it yourself.
[*] Python doesn't have true threading in the sense that concurrent threads do not execute simulatneously, even if you have idle cores. However, you still get concurrent execution; when one thread is blocked, other threads can execute.

I just nicely solved this problem using futures, available in 3.2 and backported to earlier versions including 2.x.
In my case I was retrieving results from an internal service and collating them:
def _getInfo(request,key):
return urllib2.urlopen(
'http://{0[SERVER_NAME]}:{0[SERVER_PORT]}'.format(request.META) +
reverse('my.internal.view', args=(key,))
, timeout=30)
…
with futures.ThreadPoolExecutor(max_workers=os.sysconf('SC_NPROCESSORS_ONLN')) as executor:
futureCalls = dict([ (
key,executor.submit(getInfo,request,key)
) for key in myListOfItems ])
curInfo = futureCalls[key]
if curInfo.exception() is not None:
# "exception calling for info: {0}".format(curInfo.exception())"
else:
# Handle the result…

gevent will not help you to process the task faster. It is just more efficient than threads when it comes to resource footprint. When running gevent with Django (usually via gunicorn) your web app will be able to handle more concurrent connections than a normal django wsgi app.
But: I think this has nothing to do with your problem. What you want to do is handle a huge task in one Django view, which is usually not a good idea. I personally advise you against using threads or gevents greenlets for this in Django. I see the point for standalone Python scripts or daemon's or other tools, but not for web. This mostly results in instability and more resource footprint. Instead I am agreeing with the comments of dokkaebi and Andrew Gorcester. Both comments differ somehow though, since it really depends of what your task is about.
If you can split your task into many smaller tasks you could create multiple views handling these subtasks. These views could return something like JSON and can be consumed via AJAX from your frontend. Like this you can build the content of your page as it "comes in" and the user does not need to wait until the whole page is loaded.
If you task is one huge chunk you are better off with a task queue handler. Celery comes in mind. If Celery is too overkill you can use zeroMQ. This basically works like mentioned above from Andrew: you schedule the task for processing and are polling the backend from your frontend page until the task is finished (usually also via AJAX). You could also use something like long polling here.

Python - question regarding the concurrent use of `multiprocess`

I want to use Python's multiprocessing to do concurrent processing without using locks (locks to me are the opposite of multiprocessing) because I want to build up multiple reports from different resources at the exact same time during a web request (normally takes about 3 seconds but with multiprocessing I can do it in .5 seconds).
My problem is that, if I expose such a feature to the web and get 10 users pulling the same report at the same time, I suddenly have 60 interpreters open at the same time (which would crash the system). Is this just the common sense result of using multiprocessing, or is there a trick to get around this potential nightmare?
Thanks

If you're really worried about having too many instances you could think about protecting the call with a Semaphore object. If I understand what you're doing then you can use the threaded semaphore object:
from threading import Semaphore
sem = Semaphore(10)
with sem:
make_multiprocessing_call()
I'm assuming that make_multiprocessing_call() will cleanup after itself.
This way only 10 "extra" instances of python will ever be opened, if another request comes along it will just have to wait until the previous have completed. Unfortunately this won't be in "Queue" order ... or any order in particular.
Hope that helps

You are barking up the wrong tree if you are trying to use multiprocess to add concurrency to a network app. You are barking up a completely wrong tree if you're creating processes for each request. multiprocess is not what you want (at least as a concurrency model).
There's a good chance you want an asynchronous networking framework like Twisted.

locks are only ever nessecary if you have multiple agents writing to a source. If they are just accessing, locks are not needed (and as you said defeat the purpose of multiprocessing).
Are you sure that would crash the system? On a web server using CGI, each request spawns a new process, so it's not unusual to see thousands of simultaneous processes (granted in python one should use wsgi and avoid this), which do not crash the system.
I suggest you test your theory -- it shouldn't be difficult to manufacture 10 simultaneous accesses -- and see if your server really does crash.

Multiprocessing in python with more then 2 levels

I want to do a program and want make a the spawn like this process -> n process -> n process
can the second level spawn process with multiprocessing ? using multiprocessinf module of python 2.6
thnx

#vilalian's answer is correct, but terse. Of course, it's hard to supply more information when your original question was vague.
To expand a little, you'd have your original program spawn its n processes, but they'd be slightly different than the original in that you'd want them (each, if I understand your question) to spawn n more processes. You could accomplish this by either by having them run code similar to your original process, but that spawned new sets of programs that performed the task at hand, without further processing, or you could use the same code/entry point, just providing different arguments - something like
def main(level):
if level == 0:
do_work
else:
for i in range(n):
spawn_process_that_runs_main(level-1)
and start it off with level == 2

You can structure your app as a series of process pools communicating via Queues at any nested depth. Though it can get hairy pretty quick (probably due to the required context switching).
It's not erlang though that's for sure.
The docs on multiprocessing are extremely useful.
Here(little too much to drop in a comment) is some code I use to increase throughput in a program that updates my feeds. I have one process polling for feeds that need to fetched, that stuffs it's results in a queue that a Process Pool of 4 workers picks up those results and fetches the feeds, it's results(if any) are then put in a queue for a Process Pool to parse and put into a queue to shove back in the database. Done sequentially, this process would be really slow due to some sites taking their own sweet time to respond so most of the time the process was waiting on data from the internet and would only use one core. Under this process based model, I'm actually waiting on the database the most it seems and my NIC is saturated most of the time as well as all 4 cores are actually doing something. Your mileage may vary.

Yes - but, you might run into an issue which would require the fix I committed to python trunk yesterday. See bug http://bugs.python.org/issue5313

Sure you can. Expecially if you are using fork to spawn child processes, they works as perfectly normal processes (like the father). Thread management is quite different, but you can also use "second level" sub-treading.
Pay attention to not over-complicate your program, as example program with two level threads are normally unused.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.