I need to make a web crawler to extract information from web pages. I made a research and found that Beautiful Soup was excellent since i could parse all document and create dom objects and iterate, extract attributes, etc(similitar to JQuery).
But I'm using Python 3.2 and there is no stable version for it(I think there isn't at all, just 3.1 I saw at their home page).
So I need some as good alternatives.
Looks to me like there is a version of beautiful soup of 3.2.0 released almost a year ago. There's also HTMLParser http://docs.python.org/library/htmlparser.html
I think the latest release is 4.1.1, you can read about it here BS4 Documentation
I have used BS4 with PHP on my website for this purpose for a while now, with great results. I had to switch back to BSv3 because of a PHP / Python incompatibility issue, but that is separate from how well the BS4 script works by itself.
Initially I uses the built in HTML Parsing engine, but found this slow. After installing the LMXL engine on my web server, massive increase in speed! No noticable improvement in the actual parsing, but the speed increased dramatically.
Id give it a go - I reccomend it, and I tried a LOT of different options before I settled on Beautiful soup.
Good luck!
From the lxml homepage:
The latest release works with all CPython versions from 2.4 to 3.2.
The most direct and best alternative to BeautifulSoup is Mechanize.
Mechanize is your saviour if you need to automate simple web page functionality, such as submitting a form (with information you didn't have beforehand, like CSRF tokens). It's even available in multiple programming languages!
That said, Sven's answer is right: I love lxml when I just need to extract information from HTML.
Related
I am trying to understand how beautiful soup works in python. I used beautiful soup,lxml in my past but now trying to implement one script which can read data from given webpage without any third-party libraries but it looks like xml module don't have much options and throwing many errors. Is there any other library with good documentation for reading data from web page?
I am not using these scripts on any particular websites. I am just trying to read from public pages and news blogs.
Third party libraries exist to make your life easier. Yes, of course you could write a program without them (the authors of the libraries had to). However, why reinvent the wheel?
Your best options are beautifulsoup and scrappy. However, if your having trouble with beautifulsoup, I wouldn't try scrappy.
Perhaps you can get by with just the plain text from the website?
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
pagetxt = soup.get_text()
Then you can be done with all external libraries and just work with plain text. However, if you need to do something more complicated. HTML is something you really should use a library for manipulating. They is just too much that can go wrong.
I inherited someone elses (dreadful) codebase, and am currently desperately trying to fix things. Today, that means gathering a list of all the dead links in our template/homepage.
I'm currently using ElementTree in Python, trying to parse the site using xpath. Unfortunately, it seems that the html is malformed, and ElementTree keeps throwing errors.
Are there more error friendly xpath parsers? Is there a way to run ElementTree in a non-strict mode? Are there any other methods, such as preprocessing, that can be used to help this process?
LXML can parse some malformed HTML, implements an extended version of the ElementTree API, and supports XPath:
>>> from lxml import html
>>> t = html.fromstring("""<html><body>Hello! <p> Goodbye.</body></html""")
>>> html.tostring(t.xpath("//body")[0])
'<body>Hello! <p> Goodbye.</p></body>'
My commiserations!
You'd be better off parsing your HTML with BeautifulSoup. As the homepage states:
You didn't write that awful page. You're just trying to get some data
out of it. Beautiful Soup is here to help. Since 2004, it's been
saving programmers hours or days of work on quick-turnaround screen
scraping projects.
and more importantly:
Beautiful Soup parses anything you give it, and does the tree
traversal stuff for you. You can tell it "Find all the links", or
"Find all the links of class externalLink", or "Find all the links
whose urls match "foo.com", or "Find the table heading that's got bold
text, then give me that text."
BeautifulSoup can very well deal with malformed HTML. You should also definitely look at How do I fix wrongly nested / unclosed HTML tags?. There, also Tidy was suggested.
This is a bit OT, but since it's the links you are interested in, you could also use an external link checker.
I've used Xenu Link Sleuth for years and it works great. I have a couple of sites that have more than 15,000 internal pages and running Xenu on the LAN with 30 simultaneous threads it takes about 5-8 minutes to check the site. All link types (pages, images, CSS, JS, etc.) are checked and there is a simple-but-useful exclusion mechanism. It runs on XP/7 with whatever authorization MSIE has, so you can check member/non-member views of your site.
Note: Do not run it when logged into an account that has admin privileges or it will dutifully wander backstage and start hitting delete on all your data! (Yes, I did that once -- fortunately I had a backup. :-)
I'm currently working on a project that involves a program to inspect a web page's HTML using Python. My program has to monitor a web page, and when a change is made to the HTML, it will complete a set of actions. My question is how do you extract just part of a web page, and how do you monitor a web page's HTML and report almost instantly when a change is made. Thanks.
In the past I wrote my own parsers. Nowadays HTML is HTML 5, more statements,more Javascript, a lot of crappiness done by developers and their editors, like
document.write('<SCR' + 'IPT
And some web frameworks / developers bad coding change the Last-Modified in the HTTP header on every request, even if for a human person the text you read on the page isn't changed.
I suggest you BeautifulSoup for the parsing stuff; by your own you have to careful choose what to watch to decide if the Web page is modified.
Its intro :
BeautifulSoup is a Python package that parses broken HTML, just like
lxml supports it based on the parser of libxml2. BeautifulSoup uses a
different parsing approach. It is not a real HTML parser but uses
regular expressions to dive through tag soup. It is therefore more
forgiving in some cases and less good in others. It is not uncommon
that lxml/libxml2 parses and fixes broken HTML better, but
BeautifulSoup has superiour support for encoding detection. It very
much depends on the input which parser works better.
Scrapy might be a good place to start. http://doc.scrapy.org/en/latest/intro/overview.html
Getting sections of websites is easy, it is just xml, you can use scrapy or beautifulsoup.
I am new to the world of data scraping,previously used python for web and desktop app development.
I am just wondering,if there is any way to get the urls from a page then look into it for specific information like,phone no,address etc.
Currently I am using BeautifulSoup and built method where I am telling the urls as a parameter of the methods.
The site I am scraping large and its really tough to pass the specific url for each page.
Any suggestion to make it faster and self driven?
Thanks in advance.
You can use Scrapy. It simplifies both crawling and parsing (it uses libxml2 for parsing by default).
Use a more efficient HTML parser, like lxml. See here for performance comparisons of various Python parsers.
I'm looking for a good html parser like HtmlAgilityPack (open-source .NET project: http://www.codeplex.com/htmlagilitypack), but for using with Python.
Anyone knows?
Use Beautiful Soup like everyone does.
Others have recommended BeautifulSoup, but it's much better to use lxml. Despite its name, it is also for parsing and scraping HTML. It's much, much faster than BeautifulSoup, and it even handles "broken" HTML better than BeautifulSoup (their claim to fame). It has a compatibility API for BeautifulSoup too if you don't want to learn the lxml API.
Ian Blicking agrees.
There's no reason to use BeautifulSoup anymore, unless you're on Google App Engine or something where anything not purely Python isn't allowed.
Beautiful Soup should be something you search for. It is a html/xml parser that can deal with invalid pages and allows e.g. to iterate over specific tags.