Python Urllib UrlOpen Read - python

Say I am retrieving a list of Urls from a server using Urllib2 library from Python. I noticed that it took about 5 seconds to get one page and it would take a long time to finish all the pages I want to collect.
I am thinking out of those 5 seconds. Most of the time was consumed on the server side and I am wondering could I just start using the threading library. Say 5 threads in this case, then the average time could be dramatically increased. Maybe 1 or 2 seconds in each page. (might make the server a bit busy). How could I optimize the number of threads so I could get a legit speed and not pushing the server too hard.
Thanks!
Updated:
I increased the number of threads one by one and monitored the total time (units: minutes) spent to scrape 100 URLs. and it turned out that the total time dramatically decreased when you change the number of threads to 2, and keep decreasing as you increase the number of threads, but the 'improvement' caused by threading become less and less obvious. (the total time even shows a bounce back when you build too many threads)
I know this is only a specific case for the web server that I harvest but I decided to share just to show the power of threading and hope would be helpful for somebody one day.

There are a few things you can do. If the URLs are on different domains, then you might just fan out the work to threads, each downloading a page from a different domain.
If your URLs all point to the same server and you do not want stress the server, then you can just retrieve the URLs sequentially. If the server is happy with a couple of parallel requests, the you can look into pools of workers. You could start, say a pool of four workers and add all your URL to a queue, from which the workers will pull new URLs.
Since you tagged the question with "screen-scraping" as well, scrapy is a dedicated scraping framework, which can work in parallel.
Python 3 comes with a set of new builtin concurrency primitives under concurrent.futures.

Here is a caveat. I have encountered a number of servers powered by somewhat "elderly" releases of IIS. They often will not service a request if there is not a one second delay between requests.

Related

What impacts the latency of requests with proxies in a threaded environment (Python)?

I'm doing some web scraping, and once I start running more than 50 threads (One dedicated proxy per One thread, with some of them seemingly being on the same subnet, all though I'm not sure if that's indicative of anything) the measured response time increases (I simply measure it by asserting the time before the request and printing the difference of the time after the thread and the starting time).
What might cause that?
To make sure that it's not a Python performance issue, or software design flaw I split the amount of threads per program and ran multiple instances of the scraper to utilize multiprocessing.
This did not increase the response times.

Is celery concurrency count is equal to number of tasks that can execute at once?

I understand that you should essentially set your worker count to the number of cores your node has, and if you go beyond that, you'll probably overwhelm the node.
I have hundreds of web requests (each as their own task) that will need to every minute, currently routing all of these through apply_async().
If I set my concurrency -c 10, does that mean it can only execute up to 10 of those requests at a time? Or is the concurrency count not necessarily equal to the amount of tasks it can execute at once?
It would be a waste of resources and wildly inefficient to only handle 10 requests at a time when most of that time is spent just waiting on the request to finish. When I find articles on mixing asyncio and Celery, people seem to think it's not a great idea. So what would be the solution here? Was Celery the wrong move, or does 10 concurrency ≠ only 10 simultaneous tasks.
I was using the wrong approach. I should've gone with a thread-based approach using Celery: https://www.technoarchsoftwares.com/blog/optimization-using-celery-asynchronous-processing-in-django/
Switched to gevent and it's working exactly as I'd imagined now.

Improve URL reachable check

I'm currently running a python script against multiple web server. The general task is to find out broken (external) links within a cms. Script runs pretty well so far but in reason I test around 50 internal projects and each with several hundreds sub pages. This ends in several thousands external links i have to check.
For that reason I added multi-threading - improves performance as it was my wish. But here comes the problem. If there is a page to check which contains a list of links to the same server (bundle of known issues or tasks to do) it will slow down the destination system. I neither would like to slow my own server nor server that are not mine.
Currently I running up to 20 threads and than waiting 0.5s until a "thread position" is ready to use. To check if a URL is broken I deal with urlopen(request) coming from urllib2 and log every time it throws an HTTPError. Back to the list of multiple URLs to the same server... my script will "flood" the web server with - cause of multi-threading - up to 20 simultaneous requests.
Just that you have an idea in which dimensions this script runs/URLs have to check: Using only 20 threads "slows" down the current script for only 4 projects to 45min running time. And this is only checking .. Next step will be to check broken URLs for . Using the current script shows us some peaks with 1000ms response time within server monitoring.
Does everyone has an idea how to improve this script in general? Or is there a much better way to check this big amount of URLs? Maybe a counter that pause the thread if there are 10 requests to a single destination?
Thanks for all suggestions
When I was running a crawler, I had all of my URLs prioritized by domain name. Basically, my queue of URLs to crawl was really a queue of domain names, and each domain name had a list of URLs.
When it came time to get the next URL to crawl, a thread would pull a domain name from the queue and crawl the next URL on that domain's list. When done processing that URL, the thread would put the domain on a delay list and remove from the delay list any domains whose delay had expired.
The delay list was a priority queue ordered by expiration time. That way I could give different delay times to each domain. That allowed me to support the crawl-delay extension to robots.txt. Some domains were ok with me hitting their server once per second. Others wanted a one minute delay between requests.
With this setup, I never hit the same domain with multiple threads concurrently, and I never hit them more often than they requested. My default delay was something like 5 seconds. That seems like a lot, but my crawler was looking at millions of domains, so it was never wanting for stuff to crawl. You could probably reduce your default delay.
If you don't want to queue your URLs by domain name, what you can do is maintain a list (perhaps a hash table or the python equivalent) that holds the domain names that are currently being crawled. When you dequeue a URL, you check the domain against the hash table, and put the URL back into the queue if the domain is currently in use. Something like:
goodUrl = false
while (!goodUrl)
url = urlqueue.Dequeue();
lock domainsInUse
if domainsInUse.Contains(url.domainName)
urlqueue.Add(url) // put it back at the end of the queue
else
domainsInUse.Add(url.domainName)
goodUrl = true
That will work, although it's going to be a big CPU pig if the queue contains a lot of URLs from the same domain. For example if you have 20 threads and only 5 different domains represented in the queue, then on average 15 of your threads will be continually spinning, looking for a URL to crawl.
If you only want status make a HEAD request instead of urlopen. This will considerably reduce the load on the server. And of course limit the number of simultaneous requests.
import httplib
from urlparse import urlparse
def is_up(url):
_, host, path, _, _, _ = urlparse(url)
conn = httplib.HTTPConnection(host)
conn.request('HEAD', path)
return conn.getresponse().status < 400

Max number of Python scripts running simultaneously

I have written a script to do some research on HTTP Archive data. This script needs to make HTTP requests to sites scraped by HTTP Archive in order to classify sites into groups (e.g., Drupal, WordPress, etc). The script is working really well; however, the list of sites that I am handling is 300,000 sites long.
I would like to be able to complete the categorization of sites as fast as possible. I have experimented with running multiple instances of the script at the same time and it is working well with appropriate locks in place to prevent race conditions.
How can I max this out to get all of these operations completed as fast as possible? For instance, I am looking at spinning up a VPS with 8 CPUs and 16 GB RAM. How do I maximize these resources to make sure I'm using every bit of processing power possible? I may consider spinning up something more powerful, but I want to make sure I understand how to get the most out of it so I'm not wasting money.
Thanks!
Multiprocessing module is the best option that lets you harness the maximum power of your 8 CPUs:
https://docs.python.org/3.3/library/multiprocessing.html

For my app, how many threads would be optimal?

I have a simple Python web crawler. It uses SQLite to store its output and also to keep a queue. I want to make the crawler multi-threaded so that it can crawl several pages at a time. I figured i would make a thread and just run several instances of the class at once, so they all run concurrently. But the question is, how many should i run at once? should i stick to two? can i go higher? what would be a reasonable limit for a number of threads? Keep in mind that each thread goes out to a web page, downloads the html, runs a few regex searches through it, stores the info it finds in a SQLite db, and then pops the next url off the queue.
You will probably find your application is bandwidth limited not CPU or I/O limited.
As such, add as many as you like until performance begins to degrade.
You may come up against other limits depending on your network setup. Like if you're behind an ADSL router, there will be a limit on the number of concurrent NAT sessions, which may impact making too many HTTP requests at once. Make too many and your provider may treat you as being infected by a virus or the like.
There's also the issue of how many requests the server you're crawling can handle and how much of a load you want to put on it.
I wrote a crawler once that used just one thread. It took about a day to process all the information I wanted at about one page every two seconds. I could've done it faster but I figured this was less of a burden for the server.
So really theres no hard and fast answer. Assuming a 1-5 megabit connection I'd say you could easily have up to 20-30 threads without any problems.
I would use one thread and twisted with either a deferred semaphore or a task cooperator if you already have an easy way to feed an arbitrarily long list of URLs in.
It's extremely unlikely you'll be able to make a multi-threaded crawler that's faster or smaller than a twisted-based crawler.
It's usually simpler to make multiple concurrent processes. Simply use subprocess to create as many Popens as you feel it necessary to run concurrently.
There's no "optimal" number. Generally, when you run just one crawler, your PC spends a lot of time waiting. How much? Hard to say.
When you're running some small number of concurrent crawlers, you'll see that they take about the same amount of time as one. Your CPU switches among the various processes, filling up the wait time on one with work on the others.
You you run some larger number, you see that the overall elapsed time is longer because there's now more to do than your CPU can manage. So the overall process takes longer.
You can create a graph that shows how the process scales. Based on this you can balance the number of processes and your desirable elapsed time.
Think of it this way.
1 crawler does it's job in 1 minute. 100 pages done serially could take a 100 minutes. 100 crawlers concurrently might take on hour. Let's say that 25 crawlers finishes the job in 50 minutes.
You don't know what's optimal until you run various combinations and compare the results.
cletus's answer is the one you want.
A couple of people proposed an alternate solution using asynchronous I/O, especially looking at Twisted. If you decide to go that route, a different solution is pycurl, which is a thin wrapper to libcurl, which is a widely used URL transfer library. PyCurl's home page has a 'retriever-multi.py' example of how to fetch multiple pages in parallel, in about 120 lines of code.
You can go higher that two. How much higher depends entirely on the hardware of the system you're running this on, how much processing is going on after the network operations, and what else is running on the machine at the time.
Since it's being written in Python (and being called "simple") I'm going to assume you're not exactly concerned with squeezing every ounce of performance out of the thing. In that case, I'd suggest just running some tests under common working conditions and seeing how it performs. I'd guess around 5-10 is probably reasonable, but that's a complete stab in the dark.
Since you're using a dual-core machine, I'd highly recommend checking out the Python multiprocessing module (in Python 2.6). It will let you take advantage of multiple processors on your machine, which would be a significant performance boost.
One thing you should keep in mind is that some servers may interpret too many concurrent requests from the same IP address as a DoS attack and abort connections or return error pages for requests that would otherwise succeed.
So it might be a good idea to limit the number of concurrent requests to the same server to a relatively low number (5 should be on the safe side).
Threading isn't necessary in this case. Your program is I/O bound rather than CPU bound. The networking part would probably be better done using select() on the sockets. This reduces the overhead of creating and maintaining threads. I haven't used Twisted, but I heard it has really good support for asynchronous networking. This would allow you you to specify the URLs you wish to download and register a callback for each. When each is downloaded you the callback will be called, and the page can be processed. In order to allow multiple sites to be downloaded, without waiting for each to be processed, a second "worker" thread can be created with a queue. The callback would add the site's contents to the queue. The "worker" thread would do the actual processing.
As already stated in some answers, the optimal amount of simultaneous downloads depends on your bandwidth.
I'd use one or two threads - one for the actual crawling and the other (with a queue) for processing.

Categories