these day im making some web crawler script, but one of problem is my internet is very slow.
so i was thought whether is it possible webcrawler with multithreading by use mechanize or urllib or so.
if anyone have experience ,share info much appreciate.
i was look for in google ,but not found much useful info.
Thanks in advance
There's a good, simple example on this Stack Overflow thread.
Practical threaded programming with Python is worth reading.
Making multiple requests to many websites at the same time will certainly improve your results, since you don't have to wait for a result to arrive before sending new requests.
However threading is just one of the ways to do that (and a poor one, I might add). Don't use threading for that. Just don't wait for the response before sending another request! No need for threading to do that.
A good idea is to use scrapy. It is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It is written in python and can make many concurrent connections to fetch data at the same time (without using threads to do so). It is really fast. You can also study it to see how it is implemented.
Related
Is it possible to create as many threads to use 100% of CPU and is it really efficient? I'm planning to create a crawler in Python and, in order to make the program efficient, I want to create as many threads as possible, where each thread will be downloading one website. I tried looking up for some information online; unfortunately I didn't find much.
You are confusing your terminology, but that is ok. A very high level overview would help.
Concurrency can consist of IO bound (reading and writing from disk, http requests, etc) and CPU bound work (running a machine learning optimization function on a big set of data).
With IO bound work, which is what you are referring to I am assuming, in fact your CPU is not working very hard but rather waiting around for data to come back.
Contrast that with multi-processing where you can use multiple core of your machine to do more intense CPU bound work.
That said multi-threading could help you. I would advise to use the asyncio and aiohttp modules for Python. These will help you make sure whilst you are waiting for some response to be returned, the software can continue with other requests.
I use asyncio, aiohttp and bs4 when I need to do some web-scraping.
Using web.py, I'm building a website in which I display search results from two third party websites through their public API. Unfortunately, for the APIs to send back the result takes about 4 seconds. If I query the second API only after I received the answer from the first, this obviously takes me about 8 seconds, which is way too long. To bring this down I want to send the requests to the APIs simultaneously and simply continue as soon as I received an answer from both the APIs.
My problem is now: how to do this?
I've never worked with parallel computing, but I've heard of multiprocessing and threading. I don't really know what the difference or advantages of each are. I also know that for example C++ is able to do parallel computations. It could therefore also be an option to write the part that queries the APIs in C++ (I'm a beginner in C++, but I think I'd manage). Finally, there could of course be options that I am totally overlooking. Maybe web.py has some options to do this, or maybe there are Python modules which are specifically made to do this?
Since only researching and understanding all of these options would take me quite a lot of time, I thought I'd ask you guys here for some tips.
So which one do you think I should go for? And most importantly: why? All tips are welcome!
You want an asynchronous HTTP request library. Examples of this would be gevent, or grequests.
Alternatively, you could use Python's built-in threading module to run synchronous requests in multiple threads.
Either way, no need to go to another language.
Hey guys I have looked at the best ways to do this. I want to make sure I understand correctly.
If I want:
http1
http2
http3
http...
To be sent at exactly the same time. I should set these up into a thread and then start the thread? I need to make sure it is exact same time.
I think this can be done in Java but Im not familiar with it. Thanks guys for any help you can give!
After reading up more on the process I dont think this was super clear. Will the async processing send these packets at the same time so they arrive at the destination at the same time? From reading different articles it seems like the async is just that.
I believe for what Im looking for, I would need to use a synchronous method like multiprocessing.
Thoughts?
You question is not totally clear to me but have you looked at Twisted? It's an event-driven networking engine written in Python. If you're not familiar with event-driven programming, this Linux Journal is a good introduction. Basically instead of threads, Asynchronous I/O is used with the reactor pattern (which encapsulates an event loop).
Twisted has multiple web clients. You should probably start with the newer one, called Agent (twisted.web.client.Agent), rather then the older getPage.
If you want to understand Twisted, I can recommend Dave Peticolas's Twisted Introduction. It's long but accessible and detailed.
I'd like to do perform data mining on a large scale. For this, I need a fast crawler. All I need is something to download a web page, extract links and follow them recursively, but without visiting the same url twice. Basically, I want to avoid looping.
I already wrote a crawler in python, but it's too slow. I'm not able to saturate a 100Mbit line with it. Top speed is ~40 urls/sec. and for some reason it's hard to get better results. It seems like a problem with python's multithreading/sockets. I also ran into problems with python's gargabe collector, but that was solvable. CPU isn't the bottleneck btw.
So, what should I use to write a crawler that is as fast as possible, and what's the best solution to avoid looping while crawling?
EDIT:
The solution was to combine multiprocessing and threading modules. Spawn multiple processes with multiple threads per process for best effect. Spawning multiple threads in a single process is not effective and multiple processes with just one thread consume too much memory.
Why not use something already tested for crawling, like Scrapy? I managed to reach almost 100 pages per second on a low-end VPS that has limited RAM memory (about 400Mb), while network speed was around 6-7 Mb/s (i.e. below 100Mbps).
Another improvement you can do is use urllib3 (especially when crawling many pages from a single domain). Here's a brief comparison I did some time ago:
UPDATE:
Scrapy now uses the Requests library, which in turn uses urllib3. That makes Scrapy the absolute go-to tool when it comes to scraping. Recent versions also support deploying projects, so scraping from a VPS is easier than ever.
Around 2 years ago i have developed a crawler. And it can download almost 250urls per second. You could flow my steps.
Optimize your file pointer use. Try to use minimal file pointer.
Don't write your data every time. Try to dump your data after
storing around 5000 url or 10000 url.
For your robustness you don't need to use different configuration.
Try to Use a log file and when you want to resume then just try to
read the log file and resume your crawler.
Distributed all your webcrawler task. And process it in a interval
wise.
a. downloader
b. link extractor
c. URLSeen
d. ContentSeen
I have written a simple multithreading crawler. It is available on GitHub as Discovering Web Resources and I've written a related article: Automated Discovery of Blog Feeds and Twitter, Facebook, LinkedIn Accounts Connected to Business Website. You can change the number of threads being used in the NWORKERS class variable. Don't hesitate to ask any further question if you need extra help.
It sounds like you have a design problem more than a language problem. Try looking into the multiprocessing module for accessing more sites at the same time rather than threads. Also, consider getting some table to store your previously visited sites (a database maybe?).
Impossible to tell what your limitations are. Your problem is similiar to the C10K problem -- read first, don't optimize straight away. Go for the low-hanging fruit: Most probably you get significant performance improvements by analyzing your application design. Don't start out massively-mulithreaded or massively-multiprocessed.
I'd use Twisted to write the the networking part, this can be very fast. In general, I/O on the machine has to be better than average. Either you have to write your data
to disk or to another machine, not every notebook supports 10MByte/s sustained database writes. Lastly, if you have an asynchronous internet connection, It might simply be that your upstream is saturated. ACK priorization helps here (OpenBSD example).
Was looking to write a little web crawler in python. I was starting to investigate writing it as a multithreaded script, one pool of threads downloading and one pool processing results. Due to the GIL would it actually do simultaneous downloading? How does the GIL affect a web crawler? Would each thread pick some data off the socket, then move on to the next thread, let it pick some data off the socket, etc..?
Basically I'm asking is doing a multi-threaded crawler in python really going to buy me much performance vs single threaded?
thanks!
The GIL is not held by the Python interpreter when doing network operations. If you are doing work that is network-bound (like a crawler), you can safely ignore the effects of the GIL.
On the other hand, you may want to measure your performance if you create lots of threads doing processing (after downloading). Limiting the number of threads there will reduce the effects of the GIL on your performance.
Look at how scrapy works. It can help you a lot. It doesn't use threads, but can do multiple "simultaneous" downloading, all in the same thread.
If you think about it, you have only a single network card, so parallel processing can't really help by definition.
What scrapy does is just not wait around for the response of one request before sending another. All in a single thread.
When it comes to crawling you might be better off using something event-based such as Twisted that uses non-blocking asynchronous socket operations to fetch and return data as it comes, rather than blocking on each one.
Asynchronous network operations can easily be and usually are single-threaded. Network I/O almost always has higher latency than that of CPU because you really have no idea how long a page is going to take to return, and this is where async shines because an async operation is much lighter weight than a thread.
Edit: Here is a simple example of how to use Twisted's getPage to create a simple web crawler.
Another consideration: if you're scraping a single website and the server places limits on the frequency of requests your can send from your IP address, adding multiple threads may make no difference.
Yes, multithreading scraping increases the process speed significantly. This is not a case where GIL is an issue. You are losing a lot of idle CPU and unused bandwidth waiting for a request to finish. If the web page you are scraping is in your local network (a rare scraping case) then the difference between multithreading and single thread scraping can be smaller.
You can try the benchmark yourself playing with one to "n" threads. I have written a simple multithreaded crawler on Discovering Web Resources and I wrote a related article on Automated Discovery of Blog Feeds and Twitter, Facebook, LinkedIn Accounts Connected to Business Website. You can select how many threads to use changing the NWORKERS class variable in FocusedWebCrawler.