Many would want to measure code similarity to catch plagiarisms, however my intention is to cluster a set of python code blocks (say answers to the same programming question) into different categories and distinguish different approaches taken by students.
If you have any idea how this could be achieved, I would appreciate it if you share it here.
You can choose any scheme you like that essentially hashes the contents of the code blocks, and place code blocks with identical hashes into the same category.
Of course, what will turn out to be similar will then depend highly on how you defined the hashing function. For instance, a truly stupid hashing function H(code)==0 will put everything in the same bin.
A hard problem is finding a hashing function that classifies code blocks in a way that seems similar in a natural sense. With lots of research, nobody has yet found anything better to judge this than I'll know if they are similar when I see them.
You surely do not want it to be dependent on layout/indentation/whitespace/comments, or slight changes to these will classify blocks differently even if their semantic content is identical.
There are three major schemes people have commonly used to find duplicated (or similar) code:
Metrics-based schemes, which compute the hash by counting various type of operators and operands by computing a metric. (Note: this uses lexical tokens). These often operate only at the function level. I know of no practical tools based on this.
Lexically based schemes, which break the input stream into lexemes, convert identifiers and literals into fixed special constants (e.g, treat them as undifferentiated), and then essentially hash N-grams (a sequence of N tokens) over these sequences. There are many clone detectors based on essentially this idea; they work tolerably well, but also find stupid matches because nothing forces alignment with program structure boundaries.
The sequence
return ID; } void ID ( int ID ) {
is an 11 gram which occurs frequently in C like languages but clearly isn't a useful clone). The result is that false positives tend to occur, e.g, you get claimed matches where there isn't one.
Abstract syntax tree based matching, (hashing over subtrees) which automatically aligns clones to language boundaries by virtue of using the ASTs, which represent the language structures directly. (I'm the author of the original paper on this, and build a commercial product CloneDR based on the idea, see my bio). These tools have the advantage that they can match code that contains sequences of tokens of different lengths in the middle of a match, e.g., one statement (of arbitrary size) is replaced by another.
This paper provides a survey of the various techniques: http://www.cs.usask.ca/~croy/papers/2009/RCK_SCP_Clones.pdf. It shows that AST-based clone detection tools appear to be the most effective at producing clones that people agree are similar blocks of code, which seems key to OP's particular interest; see Table 14.
[There are graph-based schemes that match control and data flow graphs. They should arguably produce even better matches but apparantly do not do much better in practice.]
One approach would be to count then number of functions, objects, keywords possibly grouped into categories such as branching, creating, manipulating, etc., and number variables of each type. Without relying on the methods and variables being called the same name(s).
For a given problem the similar approaches will tend to come out with similar scores for these, e.g.: A students who used decision tree would have a high number of branch statements while one who used a decision table would have much lower.
This approach would be much quicker to implement than parsing the code structure and comparing the results.
Related
I have a database table containing well over a million strings. Each string is a term that can vary in length from two words to five or six.
["big giant cars", "zebra videos", "hotels in rio de janeiro".......]
I also have a blacklist of over several thousand smaller terms in a csv file. What I want to do is identify similar terms in the database to the blacklisted terms in my csv file. Similarity in this case can be construed as mis-spellings of the blacklisted terms.
I am familiar with libraries in python such as fuzzywuzzy that can assess string similarity using Levensthein distance and return an integer representation of the similarity. An example from this tutorial would be:
fuzz.ratio("NEW YORK METS", "NEW YORK MEATS") ⇒ 96
A downside with this approach would be that it may falsely identify terms that may mean something in a different context.
A simple example of this would be "big butt", a blacklisted string, being confused with a more innocent string like "big but".
My question is, is it programmatically possible in python to accomplish this or would it be easier to just retrieve all the similar looking keywords and filter for false positives?
I'm not sure there's any definitive answer to this problem, so the best I can do is to explain how I'd approach this problem, and hopefully you'll be able to get any ideas from my ramblings. :-)
First.
On an unrelated angle, fuzzy string matching might not be enough. People are going to be using similar-looking characters and non-character symbols to get around any text matches, to the point where there's nearly zero match between a blacklisted word and actual text, and yet it's still readable for what it is. So perhaps you will need some normalization of your dictionary and search text, like converting all '0' (zeroes) to 'O' (capital O), '><' to 'X' etc. I believe there are libraries and/or conversion references to that purpose. Non-latin symbols are also a distinct possibility and should be accounted for.
Second.
I don't think you'll be able to differentiate between blacklisted words and similar-looking legal variants in a single pass. So yes, most likely you will have to search for possible blacklisted matches and then check if what you found matches some legal words too. Which means you will need not only the blacklisted dictionary, but a whitelisted dictionary as well. On a more positive note, there's probably no need to normalize the whitelisted dictionary, as people who're writing acceptable text are probably going to write it in acceptable language without any tricks outlined above. Or you could normalize it if you're feeling paranoid. :-)
Third.
However the problem is that matching words/expressions against black and white lists doesn't actually give you a reliable answer. Using your example, a person might write "big butt" as a honest typo which will be obvious in context (or vice versa, write a "big but" intentionally to get a higher match against a whitelisted word, even if context makes it quite obvious what the real meaning is). So you might have to actually check the context in case there are good enough matches against both black and white lists. This is an area I'm not intimately familiar with. It's probably possible to build correlation maps for various words (from both dictionaries) to identify what words are more (or less) frequently used in combination with them, and use them to check your specific example. Using this very paragraph as example, a word "black" could be whitelisted if it's used together with "list" but blacklisted in some other situations.
Fourth.
Even applying all those measures together you might want to leave a certain amount of gray area. That is, unless there's a high enough certainty in either direction, leave the final decision for a human (screening comments/posts for a time, automatically putting them into moderation queue, or whatever else your project dictates).
Fifth.
You might try to dabble in learning algorithms, collecting human input from previous step and using it to automatically fine-tune your algorithm as time goes by.
Hope that helps. :-)
I’m trying to figure out which direction to take my Python NLP project in, and I’d be very grateful to the SO community for any advice.
Problem:
Let’s say I have 100 .txt files that contain the minutes of 100 meetings held by a decisionmaking body. I also have 100 .txt files of corresponding meeting outcomes, which contain the resolutions passed by this body. The outcomes fall into one of seven categories – 1 – take no action, 2 – take soft action, 3 – take stronger action, 4 – take strongest action, 5 – cancel soft action previously taken, 6 – cancel stronger action previously taken, 7 – cancel strongest action previously taken. Alternatively, this can be presented on a scale from -3 to +3, with 0 signifying no action, +1 signifying soft action, -1 signifying cancellation of soft action previously taken, and so on.
Based on the text of the inputs, I’m interested in predicting which of these seven outcomes will occur.
I’m thinking of treating this as a form of sentiment analysis, since the decision to take a certain kind of action is basically a sentiment. However, all the sentiment analysis examples I’ve found have focused on positive/negative dichotomies, sometimes adding in neutral sentiment as a category. I haven’t found any examples with more than 3 possible classifications of outcomes – not sure whether this is because I haven’t looked in the right places, because it just isn’t really an approach of interest for whatever reason, or because this approach is a silly idea for some reason of which I’m not yet quite sure.
Question 1. Should be I approaching this as a form of sentiment analysis, or is there some other approach that would work better? Should I instead treat this as a kind of categorization matter, similar to classifying news articles by topic and training the model to recognize the "topic" (outcome)?
Corpus:
I understand that I will need to build a corpus for training/test data, and it looks like I have two immediately evident options:
1 – hand-code a CSV file for training data that would contain some key phrases from each input text and list the value of the corresponding outcome on a 7-point scale, similar to what’s been done here: http://help.sentiment140.com/for-students
2 – use the approach Pang and Lee used (http://www.cs.cornell.edu/people/pabo/movie-review-data/) and put each of my .txt files of inputs into one of seven folders based on outcomes, since the outcomes (what kind of action was taken) are known based on historical data.
The downside to the first option is that it would be very subjective – I would determine which keywords/phrases I think are the most important to include, and I may not necessarily be the best arbiter. The downside to the second option is that it might have less predictive power because the texts are pretty long, contain lots of extraneous words/phrases, and are often stylistically similar (policy speeches tend to use policy words). I looked at Pang and Lee’s data, though, and it seems like that may not be a huge problem, since the reviews they’re using are also not very varied in terms of style. I’m leaning towards the Pang and Lee approach, but I’m not sure if it would even work with more than two types of outcomes.
Question 2. Am I correct in assuming that these are my two general options for building the corpus? Am I missing some other (better) option?
Question 3. Given all of the above, which classifier should I be using? I’m thinking maximum entropy would work best; I’ve also looked into random forests, but I have no experience with the latter and really have no idea what I’m doing (yet) when it comes to them.
Thank you very much in advance :)
Question 1 - The most straightforward way to think of this is as a text classification task (sentiment analysis is one kind of text classification task, but by no means the only one).
Alternatively, as you point out, you could consider your data as existing on a continuum ranging from -3 (cancel strongest action previously taken) to +3 (take strongest action), with 0 (take no action) in the middle. In this case you could treat the outcome as a continuous variable with a natural ordering. If so, then you could treat this as a regression problem rather than a classification problem. It's hard to know whether this is a sensible thing to do without knowing more about the data. If you suspect you will have a number of words/phrases that will be very probable at one end of the scale (-3) and very improbable at the other (+3), or vice versa, then regression may make sense. On the other hand, if the relevant words/phrases are associated with strong emotion and are likely to appear at either end of the scale but not in the middle, then you may be better off treating it as classification. It also depends on how you want to evaluate your results. If your algorithm predicts that a document is a -2 and it's actually a -3, will it be penalized less than if it had predicted +3? If so, it might be better to treat this as a regression task.
Question 2. "Am I correct in assuming that these are my two general options for building the corpus? Am I missing some other (better) option?"
Note that the set of documents (the .txt files of meeting minutes and corresponding outcomes) is your corpus -- the typical thing to do is randomly select 20% or so to be set aside as test data and use the remaining 80% as training data. The two general options you consider above are options for selecting the set of features that your classification or regression algorithm should attend to.
You correctly identify the upsides and downsides of the two most obvious approaches for coming up with features (hand-picking your own vs. Pang & Lee's approach of just using unigrams (words) as phrases).
Personally I'd also lean towards this latter approach, given that it's notoriously hard for humans to predict which phrases will be useful for classification--although there's no reason why you couldn't combine the two, having your initial set of features include all words plus whatever phrases you think might be particularly relevant. As you point out, there will be a lot of extraneous words, so it may help to throw out words that are very infrequent, or that don't differ enough in frequency between classes to provide any discriminative power. Approaches for reducing an initial set of features are known as "feature selection" techniques - one common method is mentioned here. Or see this paper for a more comprehensive list.
You could also consider features like the percent of high-valence words, high-arousal words, or high-dominance words, using the dataset here (click Supplementary Material and download the zip).
Depending on how much effort you want to put into this project, another common thing to do is to try a whole bunch of approaches and see which works best. Of course, you can't test which approach works best using data in the test set--that would be cheating and would run the risk of overfitting to the test data. But you can set aside a small part of your training set as 'validation data' (i.e. a mini-test set that you use for testing different approaches). Given that you don't have that much training data (80 documents or so), you could consider using cross validation.
Question 3 - The best way is probably to try different approaches and pick whatever works best in cross-validation. But if I had to pick one or two, I personally have found that k-nearest neighbor classification (with low k) or SVMs often work well for this kind of thing. A reasonable approach might be
having your initial features be all unigrams (words) + phrases that
you think might be predictive after you look at some training data;
applying a feature selection technique to trim down your feature set;
applying any
algorithm that can deal with high-dimensional/text features, such as those in http://www.csc.kth.se/utbildning/kth/kurser/DD2475/ir10/forelasningar/Lecture9_4.pdf (lots of good tips in that pdf!), or those that achieved decent performance in the Pang & Lee paper.
Other possibilities are discussed in http://nlp.stanford.edu/IR-book/pdf/13bayes.pdf . Often the specific algorithm matters less than the features that go into it. Frankly it sounds like a very difficult sort of classification task, so it's possible that nothing will work very well.
If you decide to treat it as a regression rather than a classification task, you could go with k nearest neighbors regression ( http://www.saedsayad.com/k_nearest_neighbors_reg.htm ) or ridge regression.
Random forests often do not work well with large numbers of dependent features (words), though they may work well if you end up deciding to go with a smaller number of features (for example, a set of words/phrases you manually select, plus % of high-valence words and % of high-arousal words).
I am a newbie in python and have been trying my hands on different problems which introduce me to different modules and functionalities (I find it as a good way of learning).
I have googled around a lot but haven't found anything close to a solution to the problem.
I have a large data set of facebook posts from various groups on facebooks that use it as a medium to mass send the knowledge.
I want to make groups out of these posts which are content-wise same.
For example, one of the posts is "xyz.com is selling free domains. Go register at xyz.com"
and another is "Everyone needs to register again at xyz.com. Due to server failure, all data has been lost."
These are similar as they both ask to go the group's website and register.
P.S: Just a clarification, if any one of the links would have been abc.com, they wouldn't have been similar.
Priority is to the source and then to the action (action being registering here).
Is there a simple way to do it in python? (a module maybe?)
I know it requires some sort of clustering algorithm ( correct me if I am wrong), my question is can python make this job easier for me somehow? some module or anything?
Any help is much appreciated!
Assuming you have a function called geturls that takes a string and returns a list of urls contained within, I would do it like this:
from collections import defaultdict
groups = defaultdict(list):
for post in facebook_posts:
for url in geturls(post):
groups[url].append(post)
That greatly depends on your definition of being "content-wise same". A straight forward approach is to use a so-called Term Frequency - Inverse Document Frequency (TFIDF) model.
Simply put, make a long list of all words in all your posts, filter out stop-words (articles, determiners etc.) and for each document (=post) count how often each term occurs, and multiplying that by the importance of the team (which is the inverse document frequency, calculated by the log of the ratio of documents in which this term occurs). This way, words which are very rare will be more important than common words.
You end up with a huge table in which every document (still, we're talking about group posts here) is represented by a (very sparse) vector of terms. Now you have a metric for comparing documents. As your documents are very short, only a few terms will be significantly high, so similar documents might be the ones where the same term achieved the highest score (ie. the highest component of the document vectors is the same), or maybe the euclidean distance between the three highest values is below some parameter. That sounds very complicated, but (of course) there's a module for that.
Even after years of programming, I'm ashamed to say that I've never really fully grasped regular expressions. In general, when a problem calls for a regex, I can usually (after a bunch of referring to syntax) come up with an appropriate one, but it's a technique that I find myself using increasingly often.
So, to teach myself and understand regular expressions properly, I've decided to do what I always do when trying to learn something; i.e., try to write something ambitious that I'll probably abandon as soon as I feel I've learnt enough.
To this end, I want to write a regular expression parser in Python. In this case, "learn enough" means that I want to implement a parser that can understand Perl's extended regex syntax completely. However, it doesn't have to be the most efficient parser or even necessarily usable in the real-world. It merely has to correctly match or fail to match a pattern in a string.
The question is, where do I start? I know almost nothing about how regexes are parsed and interpreted apart from the fact that it involves a finite state automaton in some way. Any suggestions for how to approach this rather daunting problem would be much appreciated.
EDIT: I should clarify that while I'm going to implement the regex parser in Python, I'm not overly fussed about what programming language the examples or articles are written in. As long as it's not in Brainfuck, I will probably understand enough of it to make it worth my while.
Writing an implementation of a regular expression engine is indeed a quite complex task.
But if you are interested in how to do it, even if you can't understand enough of the details to actually implement it, I would recommend that you at least look at this article:
Regular Expression Matching Can Be Simple And Fast
(but is slow in Java, Perl, PHP, Python, Ruby, ...)
It explains how many of the popular programming languages implement regular expressions in a way that can be very slow for some regular expressions, and explains a slightly different method that is faster. The article includes some details of how the proposed implementation works, including some source code in C. It may be a bit heavy reading if you are just starting to learn regular expressions, but I think it is well worth knowing about the difference between the two approaches.
I've already given a +1 to Mark Byers - but as far as I remember the paper doesn't really say that much about how regular expression matching works beyond explaining why one algorithm is bad and another much better. Maybe something in the links?
I'll focus on the good approach - creating finite automata. If you limit yourself to deterministic automata with no minimisation, this isn't really too difficult.
What I'll (very quickly) describe is the approach taken in Modern Compiler Design.
Imagine you have the following regular expression...
a (b c)* d
The letters represent literal characters to match. The * is the usual zero-or-more repetitions match. The basic idea is to derive states based on dotted rules. State zero we'll take as the state where nothing has been matched yet, so the dot goes at the front...
0 : .a (b c)* d
The only possible match is 'a', so the next state we derive is...
1 : a.(b c)* d
We now have two possibilities - match the 'b' (if there's at least one repeat of 'b c') or match the 'd' otherwise. Note - we are basically doing a digraph search here (either depth first or breadth first or whatever) but we are discovering the digraph as we search it. Assuming a breadth-first strategy, we'll need to queue one of our cases for later consideration, but I'll ignore that issue from here on. Anyway, we've discovered two new states...
2 : a (b.c)* d
3 : a (b c)* d.
State 3 is an end state (there may be more than one). For state 2, we can only match the 'c', but we need to be careful with the dot position afterwards. We get "a.(b c)* d" - which is the same as state 1, so we don't need a new state.
IIRC, the approach in Modern Compiler Design is to translate a rule when you hit an operator, in order to simplify the handling of the dot. State 1 would be transformed to...
1 : a.b c (b c)* d
a.d
That is, your next option is either to match the first repetition or to skip the repetition. The next states from this are equivalent to states 2 and 3. An advantage of this approach is that you can discard all your past matches (everything before the '.') as you only care about future matches. This typically gives a smaller state model (but not necessarily a minimal one).
EDIT If you do discard already matched details, your state description is a representation of the set of strings that can occur from this point on.
In terms of abstract algebra, this is a kind of set closure. An algebra is basically a set with one (or more) operators. Our set is of state descriptions, and our operators are our transitions (character matches). A closed set is one where applying any operator to any members in the set always produces another member that is in the set. The closure of a set is the mimimal larger set that is closed. So basically, starting with the obvious start state, we are constructing the minimal set of states that is closed relative to our set of transition operators - the minimal set of reachable states.
Minimal here refers to the closure process - there may be a smaller equivalent automata which is normally referred to as minimal.
With this basic idea in mind, it's not too difficult to say "if I have two state machines representing two sets of strings, how to I derive a third representing the union" (or intersection, or set difference...). Instead of dotted rules, your state representations will a current state (or set of current states) from each input automaton and perhaps additional details.
If your regular grammars are getting complex, you can minimise. The basic idea here is relatively simple. You group all your states into one equivalence class or "block". Then you repeatedly test whether you need to split blocks (the states aren't really equivalent) with respect to a particular transition type. If all states in a particular block can accept a match of the same character and, in doing so, reach the same next-block, they are equivalent.
Hopcrofts algorithm is an efficient way to handle this basic idea.
A particularly interesting thing about minimisation is that every deterministic finite automaton has precisely one minimal form. Furthermore, Hopcrofts algorithm will produce the same representation of that minimal form, no matter what representation of what larger case it started from. That is, this is a "canonical" representation which can be used to derive a hash or for arbitrary-but-consistent orderings. What this means is that you can use minimal automata as keys into containers.
The above is probably a bit sloppy WRT definitions, so make sure you look up any terms yourself before using them yourself, but with a bit of luck this gives a fair quick introduction to the basic ideas.
BTW - have a look around the rest of Dick Grunes site - he has a free PDF book on parsing techniques. The first edition of Modern Compiler Design is pretty good IMO, but as you'll see, there's a second edition imminent.
"A play on regular expressions: functional pearl" takes an interesting approach. The implementation is given in Haskell, but it's been reimplemented in Python at least once.
The developed program is based on an old technique to turn regular expressions into finite automata which makes it efficient both in terms of worst-case time and space bounds and actual performance: despite its simplicity, the Haskell implementation can compete with a recently published professional C++ program for the same problem.
There's an interesting (if slightly short) chapter in Beautiful Code by Brian Kernighan, appropriately called "A Regular Expression Matcher". In it he discusses a simple matcher that can match literal characters, and the .^$* symbols.
I do agree that writing a regex engine will improve understanding but have you taken a look at ANTLR??. It generates the parsers automatically for any kind of language. So maybe you can try your hand by taking one of the language grammars listed at Grammar examples and run through the AST and parser that it generates. It generates a really complicated code but you will have a good understanding on how a parser works.
I have a what I think is a simple machine learning question.
Here is the basic problem: I am repeatedly given a new object and a list of descriptions about the object. For example: new_object: 'bob' new_object_descriptions: ['tall','old','funny']. I then have to use some kind of machine learning to find previously handled objects that have the 10 or less most similar descriptions, for example, past_similar_objects: ['frank','steve','joe']. Next, I have an algorithm that can directly measure whether these objects are indeed similar to bob, for example, correct_objects: ['steve','joe']. The classifier is then given this feedback training of successful matches. Then this loop repeats with a new object.
a
Here's the pseudo-code:
Classifier=new_classifier()
while True:
new_object,new_object_descriptions = get_new_object_and_descriptions()
past_similar_objects = Classifier.classify(new_object,new_object_descriptions)
correct_objects = calc_successful_matches(new_object,past_similar_objects)
Classifier.train_successful_matches(object,correct_objects)
But, there are some stipulations that may limit what classifier can be used:
There will be millions of objects put into this classifier so classification and training needs to scale well to millions of object types and still be fast. I believe this disqualifies something like a spam classifier that is optimal for just two types: spam or not spam. (Update: I could probably narrow this to thousands of objects instead of millions, if that is a problem.)
Again, I prefer speed when millions of objects are being classified, over accuracy.
Update: The classifier should return the 10 (or fewer) most similar objects, based on feedback from past training. Without this limit, an obvious cheat would be for the classifier could just return all past objects :)
What are decent, fast machine learning algorithms for this purpose?
Note: The calc_successful_matches distance metric is extremely expensive to calculate and that's why I'm using a fast machine learning algorithm to try to guess which objects will be close before I actually do the expensive calculation.
An algorithm that seems to meet your requirements (and is perhaps similar to what John the Statistician is suggesting) is Semantic Hashing. The basic idea is that it trains a deep belief network (a type of neural network that some have called 'neural networks 2.0' and is a very active area of research right now) to create a hash of the list of descriptions of an object into binary number such that the Hamming distance between the numbers correspond to similar objects. Since this just requires bitwise operations it can be pretty fast, and since you can use it to create a nearest neighbor-style algorithm it naturally generalizes to a very large number of classes. This is very good state of the art stuff. Downside: it's not trivial to understand and implement, and requires some parameter tuning. The author provides some Matlab code here. A somewhat easier algorithm to implement and is closely related to this one is Locality Sensitive Hashing.
Now that you say that you have an expensive distance function you want to approximate quickly, I'm reminded of another very interesting algorithm that does this, Boostmap. This one uses boosting to create a fast metric which approximates an expensive to calculate metric. In a certain sense it's similar to the above idea but the algorithms used are different. The authors of this paper have several papers on related techniques, all pretty good quality (published in top conferences) that you might want to check out.
do you really need a machine learning algorithm for this? What is your metric for similarity? You've mentioned the dimensionality of the number of objects, what about the size of the trait set for each person? Are there a maximum number of trait types? I might try something like this:
1) Have a dictionary mapping trait to a list of names named map
for each person p
for each trait t in p
map[t].add(p);
2) then when I want to find the closest person, I'd take my dictionary and create a new temp one:
dictionary mapping name to count called cnt
for each trait t in my person of interest
for each person p in map[t]
cnt[p]++;
then the entry with the highest count is closest
The benefit here is the map is only created once. if the traits per person is small, and the types of available traits are large, then the algorithm should be fast.
You could use the vector space model (http://en.wikipedia.org/wiki/Vector_space_model). I think what you are trying to learn is how to weight terms in considering how close two object description vectors are to each other, say for example in terms of a simplified mutual information. This could be very efficient as you could hash from terms to vectors, which means you wouldn't have to compare objects without shared features. The naive model would then have an adjustable weight per term (this could either be per term per vector, per term overall, or both), as well as a threshold. The vector space model is a widely used technique (for example, in Apache Lucene, which you might be able to use for this problem), so you'll be able to find out a lot about it through further searches.
Let me give a very simple formulation of this in terms of your example. Given bob: ['tall','old','funny'], I retrieve
frank: ['young','short,'funny']
steve: ['tall','old','grumpy']
joe: ['tall','old']
as I am maintaining a hash from funny->{frank,...}, tall->{steve, joe,...}, and old->{steve, joe,...}
I calculate something like the overall mutual information: weight of shared tags/weight of bob's tags. If that weight is over the threshold, I include them in the list.
When training, if I make a mistake I modify the shared tags. If my error was including frank, I reduce the weight for funny, while if I make a mistake by not including Steve or Joe, I increase the weight for tall and old.
You can make this as sophisticated as you'd like, for example by including weights for conjunctions of terms.
SVM is pretty fast. LIBSVM for Python, in particular, provides a very decent implementation of Support Vector Machine for classification.
This project departs from typical classification applications in two notable ways:
Rather than outputting the class which the new object is thought to belong to (or possibly outputting an array of these classes, each with probability / confidence level), the "classifier" provides a list of "neighbors" which are "close enough" to the new object.
With each new classification, an objective function, independent from the classifier, provides the list of the correct "neighbors"; in turn the corrected list (a subset of the list provided by the classifier ?) is then used to train the classifier
The idea behind the second point is probably that future objects submitted to the classifier and with similar to the current object should get better "classified" (be associated with a more correct set of previously seen objects) since the on-going training re-enforces connections to positive (correct) matches, while weakening the connection to objects which the classifier initially got wrong.
These two characteristics introduce distinct problems.
- The fact that the output is a list of objects rather than a "prototype" (or category identifier of sorts) make it difficult to scale as the number of objects seen so far grows toward the millions of instances as suggested in the question.
- The fact that the training is done on the basis of a subset of the matches found by the classifier, may introduce over-fitting, whereby the classifier could become "blind" to features (dimensions) which it, accidentally, didn't weight as important/relevant, in the early parts of the training. (I may be assuming too much with regards to the objective function in charge of producing the list of "correct" objects)
Possibly, the scaling concern could be handled by having a two-step process, with a first classifier, based the K-Means algorithm or something similar, which would produce a subset of the overall object collection (of objects previously seen) as plausible matches for the current object (effectively filtering out say 70% or more of collection). These possible matches would then be evaluated on the basis of Vector Space Model (particularly relevant if the feature dimensions are based on factors rather than values) or some other models. The underlying assumption for this two-step process is that the object collection will effectively expose clusters (it may just be relatively evenly distributed along the various dimensions).
Another way to further limit the number of candidates to evaluate, as the size of the previously seen objects grows, is to remove near duplicates and to only compare with one of these (but to supply the full duplicate list in the result, assuming that if the new object is close to the "representative" of this near duplicate class, all members of the class would also match)
The issue of over-fitting is trickier to handle. A possible approach would be to [sometimes] randomly add objects to the matching list which the classifier would not normally include. The extra objects could be added on the basis of their distance relative distance to the new object (i.e. making it a bit more probable that a relatively close object be added)
What you describe is somewhat similar to the Locally Weighted Learning algorithm, which given a query instance, it trains a model locally around the neighboring instances weighted by their distances to the query one.
Weka (Java) has an implementation of this in weka.classifiers.lazy.LWL