I'm writing a spider in Python to crawl a site. Trouble is, I need to examine about 2.5 million pages, so I could really use some help making it optimized for speed.
What I need to do is examine the pages for a certain number, and if it is found to record the link to the page. The spider is very simple, it just needs to sort through a lot of pages.
I'm completely new to Python, but have used Java and C++ before. I have yet to start coding it, so any recommendations on libraries or frameworks to include would be great. Any optimization tips are also greatly appreciated.
You could use MapReduce like Google does, either via Hadoop (specifically with Python: 1 and 2), Disco, or Happy.
The traditional line of thought, is write your program in standard Python, if you find it is too slow, profile it, and optimize the specific slow spots. You can make these slow spots faster by dropping down to C, using C/C++ extensions or even ctypes.
If you are spidering just one site, consider using wget -r (an example).
Where are you storing the results? You can use PiCloud's cloud library to parallelize your scraping easily across a cluster of servers.
As you are new to Python, I think the following may be helpful for you :)
if you are writing regex to search for certain pattern in the page, compile your regex wherever you can and reuse the compiled object
BeautifulSoup is a html/xml parser that may be of some use for your project.
Spidering somebody's site with millions of requests isn't very polite. Can you instead ask the webmaster for an archive of the site? Once you have that, it's a simple matter of text searching.
You waste a lot of time waiting for network requests when spidering, so you'll definitely want to make your requests in parallel. I would probably save the result data to disk and then have a second process looping over the files searching for the term. That phase could easily be distributed across multiple machines if you needed extra performance.
What Adam said. I did this once to map out Xanga's network. The way I made it faster is by having a thread-safe set containing all usernames I had to look up. Then I had 5 or so threads making requests at the same time and processing them. You're going to spend way more time waiting for the page to DL than you will processing any of the text (most likely), so just find ways to increase the number of requests you can get at the same time.
Related
I'm quite new to backtrader and since I've started I couldn't stop wondering why there's no database support for the datafeed. I've found a page on the official website where's described how to implement a custom datafeed. The implementation should be pretty easy, but on github (or more in general on the web) I couldn't fine a single one implementation of a feed with MongoDb. I understand that CSV are easier to manage and so on, but in some cases could require a lot of RAM for storing data all at once in memory. On the other hand, having a db can be "RAM friendly" but will take longer during the backtesting process even if the DB is a documental one. Does anyone have any experience with both of these two approaches? And if yes, there's some code I can take a look at?
Thanks!
First of all - many thanks in advance. I really appreciate it all.
So I'm in need for crawling a small amount of urls rather constantly (around every hour) and get specific data
A PHP site will be updated with the crawled data, I cannot change that
I've read this solution: Best solution to host a crawler? which seems to be fine and has the upside of using cloud services if you want something to be scaled up.
I'm also aware of the existence of Scrapy
Now, I winder if there is a more complete solution to this matter without me having to set all these things up. It seems to me that it's not a very distinguish problem that I'm trying to solve and I'd like to save time and have some more complete solution or instructions.
I would contact the person in this thread to get more specific help, but I can't. (https://stackoverflow.com/users/2335675/marcus-lind)
Currently running Windows on my personal machine, trying to mess with Scrapy is not the easiest thing, with installation problems and stuff like that.
Do you think there is no way avoiding this specific work?
In case there isn't, how do I know if I should go with Python/Scrapy or Ruby On Rails, for example?
If the data you're trying to get are reasonably well structured, you could use a third party service like Kimono or import.io.
I find setting up a basic crawler in Python to be incredibly easy. After looking at a lot of them, including Scrapy (it didn't play well with my windows machine either due to the nightmare dependencies), I settled on using Selenium's python package driven by PhantomJS for headless browsing.
Defining your crawling function would probably only take a handful of lines of code. This is a little rudimentary but if you wanted to do it super simply as a straight python script you could even do something like this and just let it run while some condition is true or until you kill the script.
from selenium import webdriver
import time
crawler = webdriver.PhantomJS()
crawler.set_window_size(1024,768)
def crawl():
crawler.get('http://www.url.com/')
# Find your elements, get the contents, parse them using Selenium or BeautifulSoup
while True:
crawl()
time.sleep(3600)
I want to write a program that searches through a fairly large website and extracts certain things. I've had a couple online Python courses, but neither said anything about how to access the internet with Python. I have no idea where I ought to start with this.
You have first to read about the standard python library urllib2.
Once you are comfortable with the basic ideas behind this lib you can try requests which is much easier to interact with the web especially APIs. I suggest using it in parallel with httpie to test out queries quick and dirty from command line.
If you go a little further building a librairy or an engine to crawl the web you will need some sort of asynchronous programming, I recommend starting with Gevent
Finally, if you want to create a crawler/bot you can take a look at Scrapy. You should however start with basic libraries before diving into this one as it can get quite complex
It sounds like you want a web crawler/scraper. What sorts of things do you want to pull? Images? Links? Just the job for a web crawler/scraper.
Start there, there should be lots of articles on Stackoverflow that will help you implement details such as connecting to the internet (getting a web response).
See this article.
There is much more in the internet than just websites, but I assume that you just want to crawl some html pages and extract data from them. You have many many options to solve that problem. Just some starting points:
urllib2 from the standard library
https://pypi.python.org/pypi/requests (much easier and more user friendly)
http://scrapy.org/ (a very good crawling framework)
http://www.crummy.com/software/BeautifulSoup/ (library to extract data from html)
For a site I'm working on I would like to import a lot of RSS feeds using Django. Since I need the content of them fast I will need to cache them locally (either in the database or in some other way)
Is there a standard app to do RSS consumption in Django, or is there a standard way to do this in Python?
Of course I could implement it myself, but I'd rather reuse a good piece of code (since there's a lot of stuff to consider, like what to do when an item updates, how long to wait before checking for updates, etc, and I'd rather reuse someone elses thinking about this).
(I did google django and rss, but everything that seems to popup is feed generation; surely there must be other sites out there using Django and consuming RSS?)
Check outhttp://feedparser.org/docs/ http://code.google.com/p/feedparser/
One of the best Python libraries for parsing RSS and Atom Feeds; although it seems like you want to do a bit more (caching, auto-refresh etc.)
I wrote a python script that reads in text from a file and writes a text file of definitions. I want to somehow integrate my program with a webpage for the whole world to see.
I want to be able to retrieve input text from one text box, have the python script process it, then display the output in the other text box.
I have done quite a bit of research thus far but I am still unsure of the best way to go about doing this. I tried using google's app engine but encountered too many problems, for example the app engine runtime environment uses python 2.5.2, I wrote my program using 3.1.2. Other than that I just felt that I was beginning to waste my time trying to port my program over.
I'm starting to think that javascript is the way to go or maybe pyjamas. I was also wondering if it would be possible to just have the python program constantly running on the server and to perform a system call.
I posses very little knowledge when it comes to web development. I appreciate any advice.
You could use the cgi module and create a CGI script, if your server supports it.
It's a much bigger question, involving:
Where are you going to host the site,
How slow is the script (can it execute in a few seconds or not),
Does it need access data from files or a database,
How complex is it,
etc.
I would suggest you read about Django:
http://docs.djangoproject.com/en/1.2/
That is probably the easiest way to set up a simple web site, but is also very powerful if you want to do something more in the future (related to this project or not).
However, since your script is Python 3 only, you don't have too many options, see this question:
https://stackoverflow.com/questions/373945/what-web-development-frameworks-support-python-3
I suppose, if not too hard, it is worth thinking about converting it to Python 2.7.
If that is an option, then you might very well go down to Python 2.5 and use Google App Engine. It gives you many things for free and you really don't need to worry about many things that you would if you were to set up your server. It includes a modified (better to say, shrunk down) version of Django 1.1. When you say you are wasting your time porting from 3.x to 2.5, I guess you were not counting the time that you will waste setting all other things up.