Processing badly formed HTML files with XPATH - python

I inherited someone elses (dreadful) codebase, and am currently desperately trying to fix things. Today, that means gathering a list of all the dead links in our template/homepage.
I'm currently using ElementTree in Python, trying to parse the site using xpath. Unfortunately, it seems that the html is malformed, and ElementTree keeps throwing errors.
Are there more error friendly xpath parsers? Is there a way to run ElementTree in a non-strict mode? Are there any other methods, such as preprocessing, that can be used to help this process?

LXML can parse some malformed HTML, implements an extended version of the ElementTree API, and supports XPath:
>>> from lxml import html
>>> t = html.fromstring("""<html><body>Hello! <p> Goodbye.</body></html""")
>>> html.tostring(t.xpath("//body")[0])
'<body>Hello! <p> Goodbye.</p></body>'

My commiserations!
You'd be better off parsing your HTML with BeautifulSoup. As the homepage states:
You didn't write that awful page. You're just trying to get some data
out of it. Beautiful Soup is here to help. Since 2004, it's been
saving programmers hours or days of work on quick-turnaround screen
scraping projects.
and more importantly:
Beautiful Soup parses anything you give it, and does the tree
traversal stuff for you. You can tell it "Find all the links", or
"Find all the links of class externalLink", or "Find all the links
whose urls match "foo.com", or "Find the table heading that's got bold
text, then give me that text."

BeautifulSoup can very well deal with malformed HTML. You should also definitely look at How do I fix wrongly nested / unclosed HTML tags?. There, also Tidy was suggested.

This is a bit OT, but since it's the links you are interested in, you could also use an external link checker.
I've used Xenu Link Sleuth for years and it works great. I have a couple of sites that have more than 15,000 internal pages and running Xenu on the LAN with 30 simultaneous threads it takes about 5-8 minutes to check the site. All link types (pages, images, CSS, JS, etc.) are checked and there is a simple-but-useful exclusion mechanism. It runs on XP/7 with whatever authorization MSIE has, so you can check member/non-member views of your site.
Note: Do not run it when logged into an account that has admin privileges or it will dutifully wander backstage and start hitting delete on all your data! (Yes, I did that once -- fortunately I had a backup. :-)

Related

How to read a HTML page that takes some time to load? [duplicate]

I am trying to scrape a web site using python and beautiful soup. I encountered that in some sites, the image links although seen on the browser is cannot be seen in the source code. However on using Chrome Inspect or Fiddler, we can see the the corresponding codes.
What I see in the source code is:
<div id="cntnt"></div>
But on Chrome Inspect, I can see a whole bunch of HTML\CSS code generated within this div class. Is there a way to load the generated content also within python? I am using the regular urllib in python and I am able to get the source but without the generated part.
I am not a web developer hence I am not able to express the behaviour in better terms. Please feel free to clarify if my question seems vague !
You need JavaScript Engine to parse and run JavaScript code inside the page.
There are a bunch of headless browsers that can help you
http://code.google.com/p/spynner/
http://phantomjs.org/
http://zombie.labnotes.org/
http://github.com/ryanpetrello/python-zombie
http://jeanphix.me/Ghost.py/
http://webscraping.com/blog/Scraping-JavaScript-webpages-with-webkit/
The Content of the website may be generated after load via javascript, In order to obtain the generated script via python refer to this answer
A regular scraper gets just the HTML document. To get any content generated by JavaScript logic, you rather need a Headless browser that would also generate the DOM, load and run the scripts like a regular browser would. The Wikipedia article and some other pages on the Net have lists of those and their capabilities.
Keep in mind when choosing that some previously major products of those are abandoned now.
TRY THIS FIRST!
Perhaps the data technically could be in the javascript itself and all this javascript engine business is needed. (Some GREAT links here!)
But from experience, my first guess is that the JS is pulling the data in via an ajax request. If you can get your program simulate that, you'll probably get everything you need handed right to you without any tedious parsing/executing/scraping involved!
It will take a little detective work though. I suggest turning on your network traffic logger (such as "Web Developer Toolbar" in Firefox) and then visiting the site. Focus your attention attention on any/all XmlHTTPRequests. The data you need should be found somewhere in one of these responses, probably in the middle of some JSON text.
Now, see if you can re-create that request and get the data directly. (NOTE: You may have to set the User-Agent of your request so the server thinks you're a "real" web browser.)

Webscraping Financial Data from Morningstar

I am trying to scrape data from the morningstar website below:
http://financials.morningstar.com/ratios/r.html?t=IBM&region=USA&culture=en_US
I am currently trying to do just IBM but hope to eventually be able to type in the code of another company and do this same with that one. My code so far is below:
import requests, os, bs4, string
url = 'http://financials.morningstar.com/ratios/r.html?t=IBM&region=USA&culture=en_US';
fin_tbl = ()
page = requests.get(url)
c = page.content
soup = bs4.BeautifulSoup(c, "html.parser")
summary = soup.find("div", {"class":"r_bodywrap"})
tables = summary.find_all('table')
print(tables[0])
The problem I am experiencing at the moment is unlike a simpler webpage I have scraped the program can't seem to locate any tables even though I can see them in the HTML for the page.
In researching this problem the closest stackoverflow question is below:
Python webscraping - NoneObeject Failure - broken HTML?
In that one they explained that Morningstar's tables are dynamically loaded and used some json code I am unfamiliar with and somehow generated a different weblink which managed to scrape the data but I don't understand where it came from?
It's a real problem scraping some modern web pages, particularly on pages generated by single-page applications (where the content is maintained by AJAX calls and DOM modification rather than delivered as ready-to-go HTML in a single server response).
The best way I have found to access such content is to use the Selenium web testing environment to have a browser load the page under the control of my program, then extract the page contents from Selenium for scraping. There are other environments that will execute the scripts and modify the DOM appropriately, but I haven't used any of them.
It's not as difficult as it sounds, but it will take you a little jiggering around to get there.
Web scraping can be greatly simplified when the site offers an API, be it officially supported or just an unofficial hack. Even the hack is better than trying to fiddle with the HTML which can change every day.
So a search for morningstar api might be fruitful. And, in fact, some friendly Gister has already worked this out for you.
Would the search be without result, a usually fruitful approach is to investigate what ajax calls the page is doing to retrieve data and then issue them directly. This can be achieved via the browser debuggers, tab "network" or so where each request can be investigated in detail in a very friendly UI.
I've found scraping dynamic sites to be a lot easier with JavaScript than with Python + Selenium. There is a great module for nodejs/phantomjs: ScraperJS. It is very easy to use: it injects jQuery into the scraped page and you can extract data with jQuery selectors.

How can I see all notes of a Tumblr post from Python?

Say I look at the following Tumblr post: http://ronbarak.tumblr.com/post/40692813…
It (currently) has 292 notes.
I'd like to get all the above notes using a Python script (e.g., via urllib2, BeautifulSoup, simplejson, or tumblr Api).
Some extensive Googling did not produce any items relating to notes' extraction in Tumblr.
Can anyone point me in the right direction on which tool will enable me to do that?
Unfortunately looks like the Tumblr API has some limitations (lacks of meta information about Reblogs, notes limited by 50), so you can't get all the notes.
It is also forbidden to do page scraping according to the Terms of Service.
"You may not do any of the following while accessing or using the Services: (...) scrape the Services, and particularly scrape Content (as defined below) from the Services, without Tumblr's express prior written consent;"
Source:
https://groups.google.com/forum/?fromgroups=#!topic/tumblr-api/ktfMIdJCOmc
Without JS you get separate pages that only contain the notes. For the mentioned blog post the first page would be:
http://ronbarak.tumblr.com/notes/40692813320/4Y70Zzacy
Following pages are linked at the bottom, e.g.:
http://ronbarak.tumblr.com/notes/40692813320/4Y70Zzacy?from_c=1358403506
http://ronbarak.tumblr.com/notes/40692813320/4Y70Zzacy?from_c=1358383221
http://ronbarak.tumblr.com/notes/40692813320/4Y70Zzacy?from_c=1358377013
…
(See my answer on how to find the next URL in a’s onclick attribute.)
Now you could use various tools to download/parse the data.
The following wget command should download all notes pages for that post:
wget --recursive --domains=ronbarak.tumblr.com --include-directories=notes http://ronbarak.tumblr.com/notes/40692813320/4Y70Zzacy
Like Fabio implies, it is better to use the API.
If for whatever reasons you cannot, then the tools you will use will depend on what you want to do with the data in the posts.
for a data dump: urllib will return a string of the page you want
looking for a specific section in the html: lxml is pretty good
looking for something in unruly html: definitely beautifulsoup
looking for a specific item in a section: beautifulsoup, lxml, text parsing is what you need.
need to put the data in a database/file: use scrapy
Tumblr url scheme is simple: url/scheme/1, url/scheme/2, url/scheme/3, etc... until you get to the end of the posts and the servers just does not return any data anymore.
So if you are going to brute force your way to scraping, you can easily tell your script to dump all the data on your hard drive until, say the contents tag, is empty.
One last word of advice, please remember to put a small sleep(1000) in your script, because you could put some stress on Tumblr servers.
how to load all notes on tumblr? also covers the topic, but unor's response (above) does it very well.

How to extract all the url's from a website?

I am writing a programme in Python to extract all the urls from a given website. All the url's from a site not from a page.
As I suppose I am not the first one who wants to do that I was wondering if there was a ready made solution or if I have to write the code myself.
It's not gonna be easy, but a decent starting point would be to look into these two libraries:
urllib
BeautifulSoup
I didn't see any ready made scripts that does this on a quick google search.
Using the scrapy framework makes this almost trivial.
The time consuming part would be learning how to use scrapy. THeir tutorials are great though and shoulndn't take you that long.
http://doc.scrapy.org/en/latest/intro/tutorial.html
Creating a solution that others can use is one of the joys of being part of a programming community. iF a scraper doesn't exist you can create one that everyone can use to get all links from a site!
The given answers are what I would have suggested (+1).
But if you really want to do something quick and simple, and you're on a *NIX platform, try this:
lynx -dump YOUR_URL | grep http
Where YOUR_URL is the URL that you want to check. This should get you all the links you want (except for links that are not fully written)
You first have to download the page's HTML content using a package like urlib or requests.
After that, you can use Beautiful Soup to extract the URLs. In fact, their tutorial shows how to extract all links enclosed in <a> elements as a specific example:
for link in soup.find_all('a'):
print(link.get('href'))
# http://example.com/elsie
# http://example.com/lacie
# http://example.com/tillie
If you also want to find links not enclosed in <a> elements, you'll may have to write something more complex on your own.
EDIT: I also just came across two Scrapy link extractor classes that were created specifically for this task:
http://doc.scrapy.org/en/latest/topics/link-extractors.html

Extracting parts of HTML from website using python

I'm currently working on a project that involves a program to inspect a web page's HTML using Python. My program has to monitor a web page, and when a change is made to the HTML, it will complete a set of actions. My question is how do you extract just part of a web page, and how do you monitor a web page's HTML and report almost instantly when a change is made. Thanks.
In the past I wrote my own parsers. Nowadays HTML is HTML 5, more statements,more Javascript, a lot of crappiness done by developers and their editors, like
document.write('<SCR' + 'IPT
And some web frameworks / developers bad coding change the Last-Modified in the HTTP header on every request, even if for a human person the text you read on the page isn't changed.
I suggest you BeautifulSoup for the parsing stuff; by your own you have to careful choose what to watch to decide if the Web page is modified.
Its intro :
BeautifulSoup is a Python package that parses broken HTML, just like
lxml supports it based on the parser of libxml2. BeautifulSoup uses a
different parsing approach. It is not a real HTML parser but uses
regular expressions to dive through tag soup. It is therefore more
forgiving in some cases and less good in others. It is not uncommon
that lxml/libxml2 parses and fixes broken HTML better, but
BeautifulSoup has superiour support for encoding detection. It very
much depends on the input which parser works better.
Scrapy might be a good place to start. http://doc.scrapy.org/en/latest/intro/overview.html
Getting sections of websites is easy, it is just xml, you can use scrapy or beautifulsoup.

Categories