I am writing a user-app that takes input from the user as the current open wikipedia page. I have written a piece of code that takes this as input to my module and generates a list of keywords related to that particular article using webscraping and natural language processing.
I want to expand the functionality of the app by providing in addition to the keywords that i have identified, a set of related topics that may be of interest to the user. Is there any API that wikipedia provides that will do the trick. If there isn't, Can anybody Point me to what i should be looking into (incase i have to write code from scratch). Also i will appreciate any pointers in identifying any algorithm that will train the machine to identify topic maps. I am not seeking any paper but rather a practical implementation of something basic
so to summarize,
I need a way to find topics related to current article in wikipedia (categories will also do)
I will also appreciate a sample algorithm for training a machine to identify topics that usually are related and clustered.
ps. please be specific because i have researched through a number of obvious possibilities
appreciate it thank you
You can scrape the categories if you want. If you're working with python, you can read the wikitext directly from their API, and use mwlib to parse the article and find the links.
A more interesting but harder to implement approach would be to create clusters of related terms, and given the list of terms extracted from an article, find the closest terms to them.
"See also" is a section often present in Wikipedia pages.
It is structured like the example below, from [[Article (publishing)]]:
==See also==
* [[Article directory]]
* [[Electronic article]]
You should then parse the wikicode (you can take that via dumps or the Mediawiki API, as hinted in the previous answers), and use the articles mentioned.
Another way is to use directly the Wikipedia categories, there are APIs for that.
Related
I'm working on a project where I need to extract "inputs" and "query intent" from text.
For example "What is the status of asset X26TH?"
In this case the main issue is to extract asset id which is X26TH, but how can I make my code understand that it's an id?
The other thing is to understand the query intent which is asset status. I found a good library for this called quepy, but it's meant for linux and I couldn't set it up on windows.
Please help me with the techniques and libraries.
So you have two problems, ID extraction and intent detection.
ID Extraction
If your IDs follow a regular pattern and definitely don't look like English, you can catch them with a regex - if that's possible, that's great since it's very easy to do. If you have a fixed list of product IDs, just check to see if any of them are in the input. If neither of those work then you'll have to get more sophisticated.
Can you get your users to remember a little syntax? If you can request that they write things with a prefix like id:X26TH or similar that would make your job easier. You may find the way the plumber in Plan9 works informative.
If you need to work with whatever the users throw at you, you should look into using a sequence labeller or Named Entity Recognition (NER) system to get IDs. CRFs are probably a good fit for this task; here's a good technial introduction, and the New York Times also used one with success. Besides being trickier to set up a downside of this is that it will require training data, but there's really no way to avoid that.
Intent Detection
This is usually modelled as a text classification problem. You can find an overview of how to do that here. Here's some training examples from the article:
training_data.append({"class":"greeting", "sentence":"how are you?"})
training_data.append({"class":"greeting", "sentence":"how is your day?"})
training_data.append({"class":"greeting", "sentence":"good day"})
training_data.append({"class":"greeting", "sentence":"how is it going today?"})
training_data.append({"class":"goodbye", "sentence":"have a nice day"})
training_data.append({"class":"goodbye", "sentence":"see you later"})
training_data.append({"class":"goodbye", "sentence":"have a nice day"})
training_data.append({"class":"goodbye", "sentence":"talk to you soon"})
training_data.append({"class":"sandwich", "sentence":"make me a sandwich"})
training_data.append({"class":"sandwich", "sentence":"can you make a sandwich?"})
training_data.append({"class":"sandwich", "sentence":"having a sandwich today?"})
training_data.append({"class":"sandwich", "sentence":"what's for lunch?"})
A description of what I'm going to accomplish:
Input 2 (N is not essential) HTML documents.
Standardize the HTML format
Diff the two documents -- external styles are not important but anything inline to the document will be included.
Determine delta at the HTML Block Element level.
Expanding the last point:
Imagine two pages of the same site that both share a sidebar with what was probably a common ancestor that has been copy/pasted. Each page has some minor changes to the sidebar. The diff will reveal these changes, then I can "walk up" the DOM to find the first common block element shared by them, or just default to <body>. In this case, I'd like to walk it up and find that, oh, they share a common <div id="sidebar">.
I'm familiar with DaisyDiff and the application is similar -- in the CMS world.
I've also begun playing with the google diff-patch library.
I wanted to give ask this kind of non-specific question to hopefully solicit any advise or guidance that anybody thinks could be helpful. Currently if you put a gun to my head and said "CODE IT" I'd rewrite DaisyDiff in Python and add-in this block-level logic. But I thought maybe there's a better way and the answers to Anyone have a diff algorithm for rendered HTML? made me feel warm and fuzzy.
If you were going to start from scratch, a useful search term would be "tree diff".
There's a pretty awesome blog post here, although I just found it by googling "daisydiff python" so I bet you've already seen it. Besides all the interesting theoretical stuff, he mentions the existence of Logilab's xmldiff, an open-source XML differ written in Python. That might be a decent starting point — maybe less correct than trying to wrap or reimplement DaisyDiff, but probably easier to get up and running quickly.
There's also html-tree-diff on pypi, which I found via this Quora link: http://www.quora.com/Is-there-any-good-Python-implementation-of-a-tree-diff-algorithm
There's some theoretical stuff about tree diffing at efficient diff algorithm for trees and Levenshtein distance on cstheory.stackexchange.
BTW, just to clarify, you are talking about diffing two DOM trees, but not necessarily rendering the diff/merge back into any particular HTML, right? (EDIT: Right.) A lot of the similarly-worded questions on here are really asking "how can I color deleted lines red and added lines green" or "how can I make matching paragraphs line up visually", skipping right over the theoretical hard part of "how do I diff two DOM trees in the first place" and the practical hard part of "how do I parse possibly malformed HTML into a DOM tree even before that". :)
I know this questions is related to python but you could take a look 3DM - XML 3-way Merging and Differencing Tool (default implementation in java) but here is the actual paper describing the algorithm used http://www.cs.hut.fi/~ctl/3dm/thesis.pdf, and here is the link to the site.
Drawback to this is that you do have to cleanup the document and be able to pars it as XML.
You could start by using beautifulsoup to parse both documents.
Then you have a choice:
use prettify to render both documents as more or less standardized HTML and diff those.
compare the parse trees.
The latter allows you to e.g. discard elements that only affect the presentation, not the content. The former is probably easier.
I am interested in generating a list of suggested semantic tags (via links to Freebase, Wikipedia or another system) to a user who is posting a short text snippet. I'm not looking to "understand" what the text is really saying, or even to automatically tag it, I just want to suggest to the user the most likely semantic tags for his/her post. My main goal is to force users to tag semantically and therefore consistently and not to write in ambiguous text strings. If there were a reasonably functional and reasonably priced tool on the market, I would use it. I have not found such a tool so I am looking in to writing my own.
My question is first of all, if there is such a tool that I have not encountered. I've looked at Zemanta, AlchemyAPI and OpenCalais and none of them seemed to offer the service I need.
Assuming that I'm writing my own, I'd be doing it in Python (unless there was a really compelling reason to use something else). My first guess would be to search for n-grams that match "entities" in Freebase and suggest them as tags, perhaps searching in descriptions of entities as well to get a little "smarter." If that proved insufficient, I'd read up and dip my toes into the ontological water. Since this is a very hard problem and I don't think that my application requires its solution, I would like to refrain from real semantic analysis as much as possible.
Does anyone have experience working with a semantic database system and could give me some pointers regarding where to begin and what sort of pitfalls to expect?
Take a look at NLTK python library. It contains a vast number of tools, dictionaries and algorithms.
I would like to scrape several different discussions forums, most of which have different HTML formats. Rather than dissecting the HTML for each page, it would be more efficient (and fun) to implement some sort of Learning Algorithm that could identify the different messages (i.e. structures) on each page, and individually parse them while simultaneously ignoring all the extraneous crap (i.e., ads and other nonsense). Could someone please point me to some references or sample code for work that's already been carried out in this area.
Moreover, does anyone know of pseudocode for Arc90's readability code?
http://www.minvolai.com/blog/decruft-arc90s-readability-in-python/
build a solution that:
takes some sample webpages with the same structure (eg forum threads)
analyzes the DOM tree of each to find which parts are the same / different
where they are different is the dynamic content you are after (posts, user names, etc)
This technique is known as wrapper induction.
There seems to be a Python port of arc90's Readability script that might point you in the right direction (or at least some direction).
Maybe not exactly correct but there's an O'Reilly book called 'Collective Intelligence' that may lead you in the right direction for what you are attempting to do. Additionally, many of the examples are in python :)
Overall Plan
Get my class information to automatically optimize and select my uni class timetable
Overall Algorithm
Logon to the website using its
Enterprise Sign On Engine login
Find my current semester and its
related subjects (pre setup)
Navigate to the right page and get the data from each related
subject (lecture, practical and
workshop times)
Strip the data of useless
information
Rank the classes which are closer
to each other higher, the ones on
random days lower
Solve a best time table solution
Output me a detailed list of the
BEST CASE information
Output me a detailed list of the
possible class information (some
might be full for example)
Get the program to select the best
classes automatically
Keep checking to see if we can
achieve 7.
6 in detail
Get all the classes, using the lectures as a focus point, would be highest ranked (only one per subject), and try to arrange the classes around that.
Questions
Can anyone supply me with links to something that might be similar to this hopefully written in python?
In regards to 6.: what data structure would you recommend to store this information in? A linked list where each object of uniclass?
Should i write all information to a text file?
I am thinking uniclass to be setup like the following
attributes:
Subject
Rank
Time
Type
Teacher
I am hardly experienced in Python and thought this would be a good learning project to try to accomplish.
Thanks for any help and links provided to help get me started, open to edits to tag appropriately or what ever is necessary (not sure what this falls under other than programming and python?)
EDIT: can't really get the proper formatting i want for this SO post ><
Depending on how far you plan on taking #6, and how big the dataset is, it may be non-trivial; it certainly smacks of NP-hard global optimisation to me...
Still, if you're talking about tens (rather than hundreds) of nodes, a fairly dumb algorithm should give good enough performance.
So, you have two constraints:
A total ordering on the classes by score;
this is flexible.
Class clashes; this is not flexible.
What I mean by flexible is that you can go to more spaced out classes (with lower scores), but you cannot be in two classes at once. Interestingly, there's likely to be a positive correlation between score and clashes; higher scoring classes are more likely to clash.
My first pass at an algorithm:
selected_classes = []
classes = sorted(classes, key=lambda c: c.score)
for clas in classes:
if not clas.clashes_with(selected_classes):
selected_classes.append(clas)
Working out clashes might be awkward if classes are of uneven lengths, start at strange times and so on. Mapping start and end times into a simplified representation of "blocks" of time (every 15 minutes / 30 minutes or whatever you need) would make it easier to look for overlaps between the start and end of different classes.
BeautifulSoup was mentioned here a few times, e.g get-list-of-xml-attribute-values-in-python.
Beautiful Soup is a Python HTML/XML parser designed for quick turnaround projects like screen-scraping. Three features make it powerful:
Beautiful Soup won't choke if you give it bad markup. It yields a parse tree that makes approximately as much sense as your original document. This is usually good enough to collect the data you need and run away.
Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree: a toolkit for dissecting a document and extracting what you need. You don't have to create a custom parser for each application.
Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8. You don't have to think about encodings, unless the document doesn't specify an encoding and Beautiful Soup can't autodetect one. Then you just have to specify the original encoding.
Beautiful Soup parses anything you give it, and does the tree traversal stuff for you. You can tell it "Find all the links", or "Find all the links of class externalLink", or "Find all the links whose urls match "foo.com", or "Find the table heading that's got bold text, then give me that text."
Valuable data that was once locked up in poorly-designed websites is now within your reach. Projects that would have taken hours take only minutes with Beautiful Soup.
There are waaay too many questions here.
Please break this down into subject areas and ask specific questions on each subject. Please focus on one of these with specific questions. Please define your terms: "best" doesn't mean anything without some specific measurement to optimize.
Here's what I think I see in your list of topics.
Scraping HTML
1 Logon to the website using its Enterprise Sign On Engine login
2 Find my current semester and its related subjects (pre setup)
3 Navigate to the right page and get the data from each related subject (lecture, practical and workshop times)
4 Strip the data of useless information
Some algorithm to "rank" based on "closer to each other" looking for a "best time". Since these terms are undefined, it's nearly impossible to provide any help on this.
5 Rank the classes which are closer to each other higher, the ones on random days lower
6 Solve a best time table solution
Output something.
7 Output me a detailed list of the BEST CASE information
8 Output me a detailed list of the possible class information (some might be full for example)
Optimize something, looking for "best". Another undefinable term.
9 Get the program to select the best classes automatically
10 Keep checking to see if we can achieve 7.
BTW, Python has "lists". Whether or not they're "linked" doesn't really enter into it.