Building an HTML Diff/Patch Algorithm - python

A description of what I'm going to accomplish:
Input 2 (N is not essential) HTML documents.
Standardize the HTML format
Diff the two documents -- external styles are not important but anything inline to the document will be included.
Determine delta at the HTML Block Element level.
Expanding the last point:
Imagine two pages of the same site that both share a sidebar with what was probably a common ancestor that has been copy/pasted. Each page has some minor changes to the sidebar. The diff will reveal these changes, then I can "walk up" the DOM to find the first common block element shared by them, or just default to <body>. In this case, I'd like to walk it up and find that, oh, they share a common <div id="sidebar">.
I'm familiar with DaisyDiff and the application is similar -- in the CMS world.
I've also begun playing with the google diff-patch library.
I wanted to give ask this kind of non-specific question to hopefully solicit any advise or guidance that anybody thinks could be helpful. Currently if you put a gun to my head and said "CODE IT" I'd rewrite DaisyDiff in Python and add-in this block-level logic. But I thought maybe there's a better way and the answers to Anyone have a diff algorithm for rendered HTML? made me feel warm and fuzzy.

If you were going to start from scratch, a useful search term would be "tree diff".
There's a pretty awesome blog post here, although I just found it by googling "daisydiff python" so I bet you've already seen it. Besides all the interesting theoretical stuff, he mentions the existence of Logilab's xmldiff, an open-source XML differ written in Python. That might be a decent starting point — maybe less correct than trying to wrap or reimplement DaisyDiff, but probably easier to get up and running quickly.
There's also html-tree-diff on pypi, which I found via this Quora link: http://www.quora.com/Is-there-any-good-Python-implementation-of-a-tree-diff-algorithm
There's some theoretical stuff about tree diffing at efficient diff algorithm for trees and Levenshtein distance on cstheory.stackexchange.
BTW, just to clarify, you are talking about diffing two DOM trees, but not necessarily rendering the diff/merge back into any particular HTML, right? (EDIT: Right.) A lot of the similarly-worded questions on here are really asking "how can I color deleted lines red and added lines green" or "how can I make matching paragraphs line up visually", skipping right over the theoretical hard part of "how do I diff two DOM trees in the first place" and the practical hard part of "how do I parse possibly malformed HTML into a DOM tree even before that". :)

I know this questions is related to python but you could take a look 3DM - XML 3-way Merging and Differencing Tool (default implementation in java) but here is the actual paper describing the algorithm used http://www.cs.hut.fi/~ctl/3dm/thesis.pdf, and here is the link to the site.
Drawback to this is that you do have to cleanup the document and be able to pars it as XML.

You could start by using beautifulsoup to parse both documents.
Then you have a choice:
use prettify to render both documents as more or less standardized HTML and diff those.
compare the parse trees.
The latter allows you to e.g. discard elements that only affect the presentation, not the content. The former is probably easier.

Related

Identifying large bodies of text via BeautifulSoup or other python based extractors

Given some random news article, I want to write a web crawler to find the largest body of text present, and extract it. The intention is to extract the physical news article on the page.
The original plan was to use a BeautifulSoup findAll(True) and to sort each tag by its .getText() value. EDIT: don't use this for html work, use the lxml library, it's python based and much faster than BeautifulSoup. command (which means extract all html tags)
But this won't work for most pages, like the one I listed as an example, because the large body of text is split into many smaller tags, like paragraph dividers for example.
Does anyone have any experience with this? Any help with something like this would be amazing.
At the moment I'm using BeautifulSoup along with python, but willing to explore other possibilities.
EDIT: Came back to this question after a few months later (wow i sounded like an idiot ^), and solved this with a combination of libraries & own code.
Here are some deadly helpful python libraries for the task in sorted order of how much it helped me:
#1 goose library Fast, powerful, consistent
#2 readability library Content is passable, slower on average than goose but faster than boilerpipe
#3 python-boilerpipe Slower & hard to install, no fault to the boilerpipe library (originally in java), but to the fact that this library is build on top of another library in java, which attributes to IO time & errors, etc.
I'll release benchmarks perhaps if there is interest.
Indirectly related libraries, you should probably install them and read their docs:
NLTK text processing library This
is too good not to install. They provide text analysis tools along
with html tools (like cleanup, etc).
lxml html/xml parser Mentioned
above. This beats BeautifulSoup in every aspect but usability. It's a
bit harder to learn but the results are worth it. HTML parsing takes
much less time, it's very noticeable.
python
webscraper library I think the value of this code isn't the
lib itself, but using the lib as a reference manual to build your own
crawlers/extractors. It's very nicely coded / documented!
A lot of the value and power in using python, a rather slow language, comes from it's open source libraries. They are especially awesome when combined and used together, and everyone should take advantage of them to solve whatever problems they may have!
Goose library gets lots of solid maintenance, they just added Arabic support, it's great!
You might look at the python-readability package which does exactly this for you.
You're really not going about it the right way, I would say, as all the comments above would attest to.
That said, this does what you're looking for.
from bs4 import BeautifulSoup as BS
import requests
html = requests.get('http://www.cnn.com/2013/01/04/justice/ohio-rape-online-video/index.html?hpt=hp_c2').text
soup = BS(html)
print '\n\n'.join([k.text for k in soup.find(class_='cnn_strycntntlft').find_all('p')])
It pulls out only the text, first by finding the main container of all the <p> tags, then by selecting only the <p> tags themselves to get the text; ignoring the <script> and other irrelevant ones.
As was mentioned in the comments, this will only work for CNN--and possibly, only this page. You might need a different strategy for every new webpage.

Suggest semantic tags for short snippets of text

I am interested in generating a list of suggested semantic tags (via links to Freebase, Wikipedia or another system) to a user who is posting a short text snippet. I'm not looking to "understand" what the text is really saying, or even to automatically tag it, I just want to suggest to the user the most likely semantic tags for his/her post. My main goal is to force users to tag semantically and therefore consistently and not to write in ambiguous text strings. If there were a reasonably functional and reasonably priced tool on the market, I would use it. I have not found such a tool so I am looking in to writing my own.
My question is first of all, if there is such a tool that I have not encountered. I've looked at Zemanta, AlchemyAPI and OpenCalais and none of them seemed to offer the service I need.
Assuming that I'm writing my own, I'd be doing it in Python (unless there was a really compelling reason to use something else). My first guess would be to search for n-grams that match "entities" in Freebase and suggest them as tags, perhaps searching in descriptions of entities as well to get a little "smarter." If that proved insufficient, I'd read up and dip my toes into the ontological water. Since this is a very hard problem and I don't think that my application requires its solution, I would like to refrain from real semantic analysis as much as possible.
Does anyone have experience working with a semantic database system and could give me some pointers regarding where to begin and what sort of pitfalls to expect?
Take a look at NLTK python library. It contains a vast number of tools, dictionaries and algorithms.

What are some of the Artificial Intelligence (AI) related techniques one would use for parsing a webpage?

I would like to scrape several different discussions forums, most of which have different HTML formats. Rather than dissecting the HTML for each page, it would be more efficient (and fun) to implement some sort of Learning Algorithm that could identify the different messages (i.e. structures) on each page, and individually parse them while simultaneously ignoring all the extraneous crap (i.e., ads and other nonsense). Could someone please point me to some references or sample code for work that's already been carried out in this area.
Moreover, does anyone know of pseudocode for Arc90's readability code?
http://www.minvolai.com/blog/decruft-arc90s-readability-in-python/
build a solution that:
takes some sample webpages with the same structure (eg forum threads)
analyzes the DOM tree of each to find which parts are the same / different
where they are different is the dynamic content you are after (posts, user names, etc)
This technique is known as wrapper induction.
There seems to be a Python port of arc90's Readability script that might point you in the right direction (or at least some direction).
Maybe not exactly correct but there's an O'Reilly book called 'Collective Intelligence' that may lead you in the right direction for what you are attempting to do. Additionally, many of the examples are in python :)

Help (or advice) me get started with lxml

I am trying to learn python, and I actually feel that "learn python the hardway", "a byte of python", and "head first python" are really great books. However - now that I want to start a "real" project, lxml makes me feel like a complete git.
This is what I would like to do (objectives)
I am trying to parse a newspaper sites article about politics
The url is http://politiken.dk/politik/
The final project should
1) each day (maybe each hour) visit
the above URL
2) for each relevant
article, I want to save the url to a
database. The relevant articles are
in a <div class="w460 section_forside
sec-forside">. Some of the elements have images, some
dont.
I would like to save the following:
a - the headline (<h1 class="top-art-header fs-26">)
b - the subheader (<p class="subheader-art">)
c - if the element has corresponding img, then the "alt" or "title" attribute
3) visit each relevant URL and scrape the articles body and save it to the database.
4) if each relevant URL is already in the database, then I skip that URL (the relevant articles as defined above are always the latest 10 published)
The desired result should be a database table with fields:
art.i) ID
art.ii) URL
art.iii) headline
art.iiii) subheader
art.iiiii) img alt
art.iiiiii) article body.
art.iiiiiii) date and time (a string located in <span class="date tr-upper m-top-2">)
The above is what I would like help to accomplish. Since screen-scraping is not always benovelent, I would like to explain why I want to do this.
Basically I want to mine the data for occurences of
members of parliment or political parties. I will not republish the articles, sell the data or some such thing (I have not checked the legality of my approach, but hope and think it should be legal)
I imagine I have a table of politicians and a table of political parties.
for each politician I will have:
pol.i) ID
pol.ii) first_name
pol.iii) sur_name
pol.iiii) party
For each political party I will have:
party.i) ID
party.ii) correct-name
party.iii) calling-name
-party.iiii) abbrevation
I want to do this for several danish newspaper sites, and then analyse if one newspaper
gives prefrences to some politicians / parties - simply based on number of mentions.
This I will also need help to do - but one step at a time :-)
Later I would like to explore NLTK and the posibilities for sentiment mining.
I want to see if this could turn in to a ph.d. project in political science/journalism.
This is basically what I have (i.e. nothing)
I really have a hard time wrapping my head around lxml, the concept of elements, the different parses etc. I have of course read the tutorials but I am still very much stuck.
import lxml.html
url = "http://politiken.dk/politik/"
root = lxml.html.parse(url).getroot()
# this should retur return all the relevant elements
# does not work:
#relevant = root.cssselect("divi.w460 section_forside sec-forside") # the class has spaces in the name - but I can't seem to escape them?
# this will return all the linked artikles headlines
artikler = root.cssselect("h1.top-art-header")
# narrowing down, we use the same call to get just the URLs of the articles that we have already retrieved
# theese urls we will later mine, and subsequently skip
retrived_urls=[]
for a in root.cssselect("h1.top-art-header a"):
retrived_urls.append(a)
# this works.
What I hope to get from the answers
First of - as long as you don't call me (very bad) names - I would continue to be happy.
But what I really hope is a simple to understand explanation of how lxml works. If I know what tools to use for the above tasks it would be so much easier for me to really "dive into lxml". Maybe because of my short attention span, I currently get disillusioned when reading stuff way above my level of understanding, when I am not even sure that I am looking in the right place.
If you could provide any example code that fits some of the tasks, that would be really great. I hope to turn this project into a ph.d. but I am sure this sort of thing must have been done a thousand times already? If so, it is my experience that learning from others is a great way to get smarter.
If you feel strongly that I should forget about lxml and use eg. scrapy or html5lib then please say so :-) I started to look into html5lib because Drew Conway suggests in a blog post about python tools for the political scientist, but I couldn't find any introduction level material. Alsp lxml is what the good people at scraperwiki recommends. As per scrapy, this might be the best solution, but I am afraid that scrapy is to much of a framework - as such really good if you know what you are doing, and want to do it fast, but maybe not the best way to learn python magic.
I plan on using a relational database, but if you think e.g. mongo would be an advantage, I will change my plans.
Since I can't install import lxml in python 3.1 I am using 2.6. If this is wrong - please say so also.
Timeframe
I have asked a bunch of beginner questions on stackoverflow. Too many to be proud of. But with more then a fulltime job I never seem to be able to burry myself in code and just absorb the skillz I so long for. I hope this will be a question/answer that I can come back to regualy and update what I have learn, and relearn what I have forgot. This also means that this question will most likely remain active for quite some time. But I will comment on every answer that I might be lucky enough to recieve, and I will continuosly update the "what I got" section.
Currently I feel that I might have bitten off more then I can chew - so now it's back to "head first python" and "learn python the hard way".
Final words
If you have gotten this far - you are amazing - even if you don't answer the question. You have now read a lot of simple, confused, and stupid questions (I am proud of asking thoose questions, so don't argue). You should grap a coffe and a filterless smoke and congratulate your self :-)
Happy holidays (in Denmark we celebrate easter and currently the sun is shining like Samual Jacksons wallet in pulp fiction)
Edit's
It seems beutifulSoup is a good choice. As per the developer however BeautifulSoup is not a good choice if I want to use python3. But as per this I would prefer python3 (not strongly though).
I have also discovered that there is an lxml chapter in "dive into python 3". Will look into that aswell.
This is a lot to read - perhaps you could break up into smaller specific questions.
Regarding lxml, here are some examples. The official documentation is also very good - take the time to work through the examples. And the mailing list is very active.
Regarding BeautifulSoup, lxml is more efficient and in my experience can handle broken HTML better than BeautifulSoup. The downside is lxml relies on C libraries so can be harder to install.
lxml is definitely the tool of choice these days for html parsing.
There is an lxml cheat sheet with many of your answers here:
http://scraperwiki.com/docs/contrib/python_lxml_cheat_sheet/
That batch of code you wrote works as-is and it runs in a ScraperWiki edit window.
http://scraperwiki.com/scrapers/andreas_stackoverflow_example/edit/
Normally a link is of the form:
title
After parsing by lxml, you can get at the link using:
a.attrib.get("href")
and the text using
a.text
However, in this particular case the links are of the form:
<span> </span> title
so the value a.text represents only the characters between '<a href="link">' and that first '<span>'.
But you can use the following code to flatten it down by recursing through the sub-elements (the <span> in this case):
def flatten(el):
result = [ (el.text or "") ]
for sel in el:
result.append(flatten(sel))
result.append(sel.tail or "")
return "".join(result)

Automated Class timetable optimize crawler?

Overall Plan
Get my class information to automatically optimize and select my uni class timetable
Overall Algorithm
Logon to the website using its
Enterprise Sign On Engine login
Find my current semester and its
related subjects (pre setup)
Navigate to the right page and get the data from each related
subject (lecture, practical and
workshop times)
Strip the data of useless
information
Rank the classes which are closer
to each other higher, the ones on
random days lower
Solve a best time table solution
Output me a detailed list of the
BEST CASE information
Output me a detailed list of the
possible class information (some
might be full for example)
Get the program to select the best
classes automatically
Keep checking to see if we can
achieve 7.
6 in detail
Get all the classes, using the lectures as a focus point, would be highest ranked (only one per subject), and try to arrange the classes around that.
Questions
Can anyone supply me with links to something that might be similar to this hopefully written in python?
In regards to 6.: what data structure would you recommend to store this information in? A linked list where each object of uniclass?
Should i write all information to a text file?
I am thinking uniclass to be setup like the following
attributes:
Subject
Rank
Time
Type
Teacher
I am hardly experienced in Python and thought this would be a good learning project to try to accomplish.
Thanks for any help and links provided to help get me started, open to edits to tag appropriately or what ever is necessary (not sure what this falls under other than programming and python?)
EDIT: can't really get the proper formatting i want for this SO post ><
Depending on how far you plan on taking #6, and how big the dataset is, it may be non-trivial; it certainly smacks of NP-hard global optimisation to me...
Still, if you're talking about tens (rather than hundreds) of nodes, a fairly dumb algorithm should give good enough performance.
So, you have two constraints:
A total ordering on the classes by score;
this is flexible.
Class clashes; this is not flexible.
What I mean by flexible is that you can go to more spaced out classes (with lower scores), but you cannot be in two classes at once. Interestingly, there's likely to be a positive correlation between score and clashes; higher scoring classes are more likely to clash.
My first pass at an algorithm:
selected_classes = []
classes = sorted(classes, key=lambda c: c.score)
for clas in classes:
if not clas.clashes_with(selected_classes):
selected_classes.append(clas)
Working out clashes might be awkward if classes are of uneven lengths, start at strange times and so on. Mapping start and end times into a simplified representation of "blocks" of time (every 15 minutes / 30 minutes or whatever you need) would make it easier to look for overlaps between the start and end of different classes.
BeautifulSoup was mentioned here a few times, e.g get-list-of-xml-attribute-values-in-python.
Beautiful Soup is a Python HTML/XML parser designed for quick turnaround projects like screen-scraping. Three features make it powerful:
Beautiful Soup won't choke if you give it bad markup. It yields a parse tree that makes approximately as much sense as your original document. This is usually good enough to collect the data you need and run away.
Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree: a toolkit for dissecting a document and extracting what you need. You don't have to create a custom parser for each application.
Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8. You don't have to think about encodings, unless the document doesn't specify an encoding and Beautiful Soup can't autodetect one. Then you just have to specify the original encoding.
Beautiful Soup parses anything you give it, and does the tree traversal stuff for you. You can tell it "Find all the links", or "Find all the links of class externalLink", or "Find all the links whose urls match "foo.com", or "Find the table heading that's got bold text, then give me that text."
Valuable data that was once locked up in poorly-designed websites is now within your reach. Projects that would have taken hours take only minutes with Beautiful Soup.
There are waaay too many questions here.
Please break this down into subject areas and ask specific questions on each subject. Please focus on one of these with specific questions. Please define your terms: "best" doesn't mean anything without some specific measurement to optimize.
Here's what I think I see in your list of topics.
Scraping HTML
1 Logon to the website using its Enterprise Sign On Engine login
2 Find my current semester and its related subjects (pre setup)
3 Navigate to the right page and get the data from each related subject (lecture, practical and workshop times)
4 Strip the data of useless information
Some algorithm to "rank" based on "closer to each other" looking for a "best time". Since these terms are undefined, it's nearly impossible to provide any help on this.
5 Rank the classes which are closer to each other higher, the ones on random days lower
6 Solve a best time table solution
Output something.
7 Output me a detailed list of the BEST CASE information
8 Output me a detailed list of the possible class information (some might be full for example)
Optimize something, looking for "best". Another undefinable term.
9 Get the program to select the best classes automatically
10 Keep checking to see if we can achieve 7.
BTW, Python has "lists". Whether or not they're "linked" doesn't really enter into it.

Categories