Multithreaded crawler in Python

Multithreaded crawler in Python - python

Is it possible to create as many threads to use 100% of CPU and is it really efficient? I'm planning to create a crawler in Python and, in order to make the program efficient, I want to create as many threads as possible, where each thread will be downloading one website. I tried looking up for some information online; unfortunately I didn't find much.

You are confusing your terminology, but that is ok. A very high level overview would help.
Concurrency can consist of IO bound (reading and writing from disk, http requests, etc) and CPU bound work (running a machine learning optimization function on a big set of data).
With IO bound work, which is what you are referring to I am assuming, in fact your CPU is not working very hard but rather waiting around for data to come back.
Contrast that with multi-processing where you can use multiple core of your machine to do more intense CPU bound work.
That said multi-threading could help you. I would advise to use the asyncio and aiohttp modules for Python. These will help you make sure whilst you are waiting for some response to be returned, the software can continue with other requests.
I use asyncio, aiohttp and bs4 when I need to do some web-scraping.

Related

Is this running truly parallel?

I'm using requests and threading in python to do some stuff. My question is: Is this code running truly multithreaded and is it safe to use? I'm experiencing some slow down over time. Note: I'm not using this exact code but mine is doing similar things.
import time
import requests
current_threads = 0
max_threads = 32
def doStuff():
global current_threads
r = requests.get('https://google.de')
current_threads-=1
while True:
while current_threads >= max_threads:
time.sleep(0.05)
thread = threading.Thread(target = doStuff)
thread.start()
current_threads+=1

There could be a number of reasons for the issue you are facing. I'm not an expert in Python but I can see the potential for a number of causes for the slow down. Potential reasons I can think of are as follows:
Depending on the size of the data you are pulling down you could potentially be overloading your bandwidth. Hard one to prove without seeing the exact code you are are using and what your code is doing and knowing your bandwidth.
Kinda connected to the fist one but if your files are taking some time to come down per thread it maybe getting clogged up at the:
while current_threads >= max_threads:
time.sleep(0.05)
you could try reducing the max number of threads and see if that helps though it may not if it's the files that are taking time to download.
The problem may not be with your code or your bandwidth but with the server you are pulling the files from, if that server is over loaded it maybe slowing down your transfers.
Firewalls, IPS, IDS, Policys on the server maybe throttling your requests. If you make too many requests to quickly all from the same IP the server side network equipment may mistake this as some sort of DoS attack and throttle your requests in response.
Unfortunately Python, as compared to other lower level languages such as C# or C++ is not as good at multithreading. This is due to something called the GIL (Global Interpreter Lock) which comes into play when you are accessing/manipulating the same data in multiple threads. This is quite a sizeable subject in it's self but if you want to read up on it have a look at this link.
https://medium.com/practo-engineering/threading-vs-multiprocessing-in-python-7b57f224eadb
Sorry I can't be of any more assistance but this is as much as I can say on the subject given the provided information.

Sure, you're running multiple threads and provided they're not accessing/mutating the same resources you're probably "safe".
Whenever I'm accessing external resources (ie, using requests), I always recommend asyncio over vanilla threading, as it allows custom context switching (everywhere you have an "await" you switch contexts, whereas in vanilla threading switching between threads is determined by the OS and might not be optimal) and reduced overhead (you're only using ONE thread).

Is the Multi interface in curl faster or more efficient than using multiple easy interfaces?

I am making something which involves pycurl since pycurl depends on libcurl, I was reading through its documentation and came across this Multi interface where you could perform several transfers using a single multi object. I was wondering if this is faster/more memory efficient than having miltiple easy interfaces ? I was wondering what is the advantage with this approach since the site barely says,
"Enable multiple simultaneous transfers in the same thread without making it complicated for the application."

You are trying to optimize something that doesn't matter at all.
If you want to download 200 URLs as fast as possible, you are going to spend 99.99% of your time waiting for those 200 requests, limited by your network and/or the server(s) you're downloading from. The key to optimizing that is to make the right number of concurrent requests. Anything you can do to cut down the last 0.01% will have no visible effect on your program. (See Amdahl's Law.)
Different sources give different guidelines, but typically it's somewhere between 6-12 requests, no more than 2-4 to the same server. Since you're pulling them all from Google, I'd suggest starting 4 concurrent requests, then, if that's not fast enough, tweaking that number until you get the best results.
As for space, the cost of storing 200 pages is going to far outstrip the cost of a few dozen bytes here and there for overhead. Again, what you want to optimize is those 200 pages—by storing them to disk instead of in memory, by parsing them as they come in instead of downloading everything and then parsing everything, etc.
Anyway, instead of looking at what command-line tools you have and trying to find a library that's similar to those, look for libraries directly. pycurl can be useful in some cases, e.g., when you're trying to do something complicated and you already know how to do it with libcurl, but in general, it's going to be a lot easier to use either stdlib modules like urllib or third-party modules designed to be as simple as possible like requests.
The main example for ThreadPoolExecutor in the docs shows how to do exactly what you want to do. (If you're using Python 2.x, you'll have to pip install futures to get the backport for ThreadPoolExecutor, and use urllib2 instead of urllib.request, but otherwise the code will be identical.)

Having multiple easy interfaces running concurrently in the same thread means building your own reactor and driving curl at a lower level. That's painful in C, and just as painful in Python, which is why libcurl offers, and recommends, multi.
But that "in the same thread" is key here. You can also create a pool of threads and throw the easy instances into that. In C, that can still be painful; in Python, it's dead simple. In fact, the first example in the docs for using a concurrent.futures.ThreadPoolExecutor does something similar, but actually more complicated than you need here, and it's still just a few lines of code.
If you're comparing multi vs. easy with a manual reactor, the simplicity is the main benefit. In C, you could easily implement a more efficient reactor than the one libcurl uses; in Python, that may or may not be true. But in either language, the performance cost of switching among a handful of network requests is going to be so tiny compared to everything else you're doing—especially waiting for those network requests—that it's unlikely to ever matter.
If you're comparing multi vs. easy with a thread pool, then a reactor can definitely outperform threads (except on platforms where you can tie a thread pool to a proactor, as with Windows I/O completion ports), especially for huge numbers of concurrent connections. Also, each thread needs its own stack, which typically means about 1MB of memory pages allocated (although not all of them used), which can be a serious problem in 32-bit land for huge numbers of connections. That's why very few serious servers use threads for connections. But in a client making a handful of connections, none of this matters; again, the costs incurred by wasting 8 threads vs. using a reactor will be so small compared to the real costs of your program that they won't matter.

Fast internet crawler

I'd like to do perform data mining on a large scale. For this, I need a fast crawler. All I need is something to download a web page, extract links and follow them recursively, but without visiting the same url twice. Basically, I want to avoid looping.
I already wrote a crawler in python, but it's too slow. I'm not able to saturate a 100Mbit line with it. Top speed is ~40 urls/sec. and for some reason it's hard to get better results. It seems like a problem with python's multithreading/sockets. I also ran into problems with python's gargabe collector, but that was solvable. CPU isn't the bottleneck btw.
So, what should I use to write a crawler that is as fast as possible, and what's the best solution to avoid looping while crawling?
EDIT:
The solution was to combine multiprocessing and threading modules. Spawn multiple processes with multiple threads per process for best effect. Spawning multiple threads in a single process is not effective and multiple processes with just one thread consume too much memory.

Why not use something already tested for crawling, like Scrapy? I managed to reach almost 100 pages per second on a low-end VPS that has limited RAM memory (about 400Mb), while network speed was around 6-7 Mb/s (i.e. below 100Mbps).
Another improvement you can do is use urllib3 (especially when crawling many pages from a single domain). Here's a brief comparison I did some time ago:
UPDATE:
Scrapy now uses the Requests library, which in turn uses urllib3. That makes Scrapy the absolute go-to tool when it comes to scraping. Recent versions also support deploying projects, so scraping from a VPS is easier than ever.

Around 2 years ago i have developed a crawler. And it can download almost 250urls per second. You could flow my steps.
Optimize your file pointer use. Try to use minimal file pointer.
Don't write your data every time. Try to dump your data after
storing around 5000 url or 10000 url.
For your robustness you don't need to use different configuration.
Try to Use a log file and when you want to resume then just try to
read the log file and resume your crawler.
Distributed all your webcrawler task. And process it in a interval
wise.
a. downloader
b. link extractor
c. URLSeen
d. ContentSeen

I have written a simple multithreading crawler. It is available on GitHub as Discovering Web Resources and I've written a related article: Automated Discovery of Blog Feeds and Twitter, Facebook, LinkedIn Accounts Connected to Business Website. You can change the number of threads being used in the NWORKERS class variable. Don't hesitate to ask any further question if you need extra help.

It sounds like you have a design problem more than a language problem. Try looking into the multiprocessing module for accessing more sites at the same time rather than threads. Also, consider getting some table to store your previously visited sites (a database maybe?).

Impossible to tell what your limitations are. Your problem is similiar to the C10K problem -- read first, don't optimize straight away. Go for the low-hanging fruit: Most probably you get significant performance improvements by analyzing your application design. Don't start out massively-mulithreaded or massively-multiprocessed.
I'd use Twisted to write the the networking part, this can be very fast. In general, I/O on the machine has to be better than average. Either you have to write your data
to disk or to another machine, not every notebook supports 10MByte/s sustained database writes. Lastly, if you have an asynchronous internet connection, It might simply be that your upstream is saturated. ACK priorization helps here (OpenBSD example).

Does a multithreaded crawler in Python really speed things up?

Was looking to write a little web crawler in python. I was starting to investigate writing it as a multithreaded script, one pool of threads downloading and one pool processing results. Due to the GIL would it actually do simultaneous downloading? How does the GIL affect a web crawler? Would each thread pick some data off the socket, then move on to the next thread, let it pick some data off the socket, etc..?
Basically I'm asking is doing a multi-threaded crawler in python really going to buy me much performance vs single threaded?
thanks!

The GIL is not held by the Python interpreter when doing network operations. If you are doing work that is network-bound (like a crawler), you can safely ignore the effects of the GIL.
On the other hand, you may want to measure your performance if you create lots of threads doing processing (after downloading). Limiting the number of threads there will reduce the effects of the GIL on your performance.

Look at how scrapy works. It can help you a lot. It doesn't use threads, but can do multiple "simultaneous" downloading, all in the same thread.
If you think about it, you have only a single network card, so parallel processing can't really help by definition.
What scrapy does is just not wait around for the response of one request before sending another. All in a single thread.

When it comes to crawling you might be better off using something event-based such as Twisted that uses non-blocking asynchronous socket operations to fetch and return data as it comes, rather than blocking on each one.
Asynchronous network operations can easily be and usually are single-threaded. Network I/O almost always has higher latency than that of CPU because you really have no idea how long a page is going to take to return, and this is where async shines because an async operation is much lighter weight than a thread.
Edit: Here is a simple example of how to use Twisted's getPage to create a simple web crawler.

Another consideration: if you're scraping a single website and the server places limits on the frequency of requests your can send from your IP address, adding multiple threads may make no difference.

Yes, multithreading scraping increases the process speed significantly. This is not a case where GIL is an issue. You are losing a lot of idle CPU and unused bandwidth waiting for a request to finish. If the web page you are scraping is in your local network (a rare scraping case) then the difference between multithreading and single thread scraping can be smaller.
You can try the benchmark yourself playing with one to "n" threads. I have written a simple multithreaded crawler on Discovering Web Resources and I wrote a related article on Automated Discovery of Blog Feeds and Twitter, Facebook, LinkedIn Accounts Connected to Business Website. You can select how many threads to use changing the NWORKERS class variable in FocusedWebCrawler.

Python - question regarding the concurrent use of `multiprocess`

I want to use Python's multiprocessing to do concurrent processing without using locks (locks to me are the opposite of multiprocessing) because I want to build up multiple reports from different resources at the exact same time during a web request (normally takes about 3 seconds but with multiprocessing I can do it in .5 seconds).
My problem is that, if I expose such a feature to the web and get 10 users pulling the same report at the same time, I suddenly have 60 interpreters open at the same time (which would crash the system). Is this just the common sense result of using multiprocessing, or is there a trick to get around this potential nightmare?
Thanks

If you're really worried about having too many instances you could think about protecting the call with a Semaphore object. If I understand what you're doing then you can use the threaded semaphore object:
from threading import Semaphore
sem = Semaphore(10)
with sem:
make_multiprocessing_call()
I'm assuming that make_multiprocessing_call() will cleanup after itself.
This way only 10 "extra" instances of python will ever be opened, if another request comes along it will just have to wait until the previous have completed. Unfortunately this won't be in "Queue" order ... or any order in particular.
Hope that helps

You are barking up the wrong tree if you are trying to use multiprocess to add concurrency to a network app. You are barking up a completely wrong tree if you're creating processes for each request. multiprocess is not what you want (at least as a concurrency model).
There's a good chance you want an asynchronous networking framework like Twisted.

locks are only ever nessecary if you have multiple agents writing to a source. If they are just accessing, locks are not needed (and as you said defeat the purpose of multiprocessing).
Are you sure that would crash the system? On a web server using CGI, each request spawns a new process, so it's not unusual to see thousands of simultaneous processes (granted in python one should use wsgi and avoid this), which do not crash the system.
I suggest you test your theory -- it shouldn't be difficult to manufacture 10 simultaneous accesses -- and see if your server really does crash.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.