Real world typo statistics? [closed] - python

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Where can I find some real world typo statistics?
I'm trying to match people's input text to internal objects, and people tend to make spelling mistakes.
There are 2 kinds of mistakes:
typos - "Helllo" instead of "Hello" / "Satudray" instead of "Saturday" etc.
Spelling - "Shikago" instead of "Chicago"
I use Damerau-Levenshtein distance for the typos and Double Metaphone for spelling (Python implementations here and here).
I want to focus on the Damerau-Levenshtein (or simply edit-distance). The textbook implementations always use '1' for the weight of deletions, insertions substitutions and transpositions. While this is simple and allows for nice algorithms it doesn't match "reality" / "real-world probabilities".
Examples:
I'm sure the likelihood of "Helllo" ("Hello") is greater than "Helzlo", yet they are both 1 edit distance away.
"Gello" is closer than "Qello" to "Hello" on a QWERTY keyboard.
Unicode transliterations: What is the "real" distance between "München" and "Munchen"?
What should the "real world" weights be for deletions, insertions, substitutions, and transpositions?
Even Norvig's very cool spell corrector uses non-weighted edit distance.
BTW- I'm sure the weights need to be functions and not simple floats (per the above
examples)...
I can adjust the algorithm, but where can I "learn" these weights? I don't have access to Google-scale data...
Should I just guess them?
EDIT - trying to answer user questions:
My current non-weighted algorithm fails often when faced with typos for the above reasons. "Return on Tursday": every "real person" can easily tell Thursday is more likely than Tuesday, yet they are both 1-edit-distance away! (Yes, I do log and measure my performance).
I'm developing an NLP Travel Search engine, so my dictionary contains ~25K destinations (expected to grow to 100K), Time Expressions ~200 (expected 1K), People expressions ~100 (expected 300), Money Expressions ~100 (expected 500), "glue logic words" ("from", "beautiful", "apartment") ~2K (expected 10K) and so on...
Usage of the edit distance is different for each of the above word-groups. I try to "auto-correct when obvious", e.g. 1 edit distance away from only 1 other word in the dictionary. I have many other hand-tuned rules, e.g. Double Metaphone fix which is not more than 2 edit distance away from a dictionary word with a length > 4... The list of rules continues to grow as I learn from real world input.
"How many pairs of dictionary entries are within your threshold?": well, that depends on the "fancy weighting system" and on real world (future) input, doesn't it? Anyway, I have extensive unit tests so that every change I make to the system only makes it better (based on past inputs, of course). Most sub-6 letter words are within 1 edit distance from a word that is 1 edit distance away from another dictionary entry.
Today when there are 2 dictionary entries at the same distance from the input I try to apply various statistics to better guess which the user meant (e.g. Paris, France is more likely to show up in my search than Pārīz, Iran).
The cost of choosing a wrong word is returning semi-random (often ridiculous) results to the end-user and potentially losing a customer. The cost of not understanding is slightly less expensive: the user will be asked to rephrase.
Is the cost of complexity worth it? Yes, I'm sure it is. You would not believe the amount of typos people throw at the system and expect it to understand, and I could sure use the boost in Precision and Recall.

Possible source for real world typo statistics would be in the Wikipedia's complete edit history.
http://download.wikimedia.org/
Also, you might be interested in the AWB's RegExTypoFix
http://en.wikipedia.org/wiki/Wikipedia:AWB/T

I would advise you to check the trigram alogrithm. In my opinion it works better for finding typos then edit distance algorithm. It should work faster as well and if you keep dictionary in postgres database you can make use of index.
You may find useful stackoverflow topic about google "Did you mean"

Probability Scoring for Spelling Correction by Church and Gale might help. In that paper, the authors model typos as a noisy channel between the author and the computer. The appendix has tables for typos seen in a corpus of Associated Press publications. There is a table for each of the following kinds of typos:
deletion
insertion
substitution
transposition
For example, examining the insertion table, we can see that l was incorrectly inserted after l 128 times (the highest number in that column). Using these tables, you can generate the probabilities you're looking for.

If the research is your interest I think continuing with that algorithm, trying to find decent weights would be fruitful.
I can't help you with typo stats, but I think you should also play with python's difflib. Specifically, the ratio() method of SequenceMatcher. It uses an algorithm which the docs http://docs.python.org/library/difflib.html claim is well suited to matches that 'look right', and may be useful to augment or test what you're doing.
For python programmers just looking for typos it is a good place to start. One of my coworkers has used both Levenshtein edit distance and SequenceMatcher's ratio() and got much better results from ratio().

Some questions for you, to help you determine whether you should be asking your "where do I find real-world weights" question:
Have you actually measured the effectiveness of the uniform weighting implementation? How?
How many different "internal objects" do you have -- i.e. what is the size of your dictionary?
How are you actually using the edit distance e.g. John/Joan, Marmaduke/Marmeduke, Featherstonehaugh/Featherstonhaugh: is that "all 1 error" or is it 25% / 11.1% / 5.9% difference? What threshold are you using?
How many pairs of dictionary entries are within your threshold (e.g. John vs Joan, Joan vs Juan, etc)? If you introduced a fancy weighting system, how many pairs of dictionary entries would migrate (a) from inside the threshold to outside (b) vice versa?
What do you do if both John and Juan are in your dictionary and the user types Joan?
What are the penalties/costs of (1) choosing the wrong dictionary word (not the one that the user meant) (2) failing to recognise the user's input?
Will introducing a complicated weighting system actually reduce the probabilities of the above two error types by sufficient margin to make the complication and slower speed worthwhile?
BTW, how do you know what keyboard the user was using?
Update:
"""My current non-weighted algorithm fails often when faced with typos for the above reasons. "Return on Tursday": every "real person" can easily tell Thursday is more likely than Tuesday, yet they are both 1-edit-distance away! (Yes, I do log and measure my performance)."""
Yes, Thursday -> Tursday by omitting an "h", but Tuesday -> Tursday by substituting "r" instead of "e". E and R are next to each other on qwERty and azERty keyboards. Every "real person" can easily guess that Thursday is more likely than Tuesday. Even if statistics as well as guesses point to Thursday being more likely than Tuesday (perhaps omitting h will cost 0.5 and e->r will cost 0.75), will the difference (perhaps 0.25) be significant enough to always pick Thursday? Can/will your system ask "Did you mean Tuesday?" or does/will it just plough ahead with Thursday?

Related

The difference in application between SequenceMatcher in edit distance and that in difflib?

I know the implementation of the edit distance algorithm. By dynamic programming, we first fill the first column and first row and then the entries immediately right and below of the filled entries by comparing three paths from the left, above, and left above. While for the Ratcliff/Obershelp algorithm, we first extract the longest common substring out from the two strings, then we do recursive operations for the left side two sub-strings and right side two sub-strings until no characters are left.
Both of them can be utilized to calculate the similarity between two strings and transform one string into another using four operations: delete, replace, copy, and insert.
But I wonder when to use which between SequenceMatcher in edit distance and that in difflib?
Here is what I found on the internet that makes me think that this question would also benefit others:
In the documentation of edit distance it reads that
Similar to the difflib SequenceMatcher, but uses Levenshtein/edit distance.
In this answer to a question on calculating edit distance, an answer on Ratcliff/Obershelp algorithm was provided.
There are only a few resources about the Ratcliff/Obershelp algorithm, let alone its comparison to edit distance that I thought is the most well known string alignment algorithm.
So far as I know, I have the following ideas:
I find that edit distance and the Ratcliff/Obershelp algorithm can both be used for spell checking. But when to use which?
I thought the edit distance is employed to find the minimal edit sequence while the Ratcliff/Obershelp algorithm yields matches that "look right" to people. However, 'look right' seems too vague a term, especially in real world applications. What's more, when is the minimum edit sequence a must/preferred?
Any suggestions would be highly appreciated, and thanks in advance.
"Looks right to people" needn't be all that vague. Search the web for discussion of why, e.g., the very widely used git source control system added "patience" and "histogram" differencing algorithms, as options. Variations of "minimal edit distance" routinely produce diffs that are jarring to humans, and I'm not going to reproduce examples here that are easily found by searching.
From a formal perspective, Levenshtein is more in line with what a mathematician means by "distance". Chiefly, difflib's .ratio() can depend on the order of the arguments passed to it, but Levenshtein is insensitve to order:
>>> import difflib
>>> difflib.SequenceMatcher(None, "tide", "diet").ratio()
0.25
>>> difflib.SequenceMatcher(None, "diet", "tide").ratio()
0.5
For the rest, I don't think you're going to get crisp answers. There are many notions of "similarity", not just the two you mentioned, and they all have their fans. "Minimal" was probably thought to be more important back when disk space and bandwidth were scarce and expensive.
The physical realities constraining genetic mutation have made measures that take into account spatial locality much more important in that field - doesn't matter one whit if it's "minimal" if it's also physically implausible ;-) Terms to search for: Smith–Waterman, and Needleman–Wunsch.

What is Grover's algorithm in a big simplification?

I'm new in quantum computing. I mean extremely new. I saw that there is some king of an algorithm called Grover's search algorithm. I have read that it searches through the database containing N-elements in order to find the specific element. I also read that standard computers would be doing it for many, many years while quantum computers would do it in just a few seconds. And that is what confuses me the most. How I understand this:
Let's say we want to search the database containing 50.000 different names and we are looking for a name "Jack". The standard computer wouldn't do it for years right? I think there's matter of seconds or minutes as searching through the database containing names which is probably text won't take long...
Example in python:
names = ["Mark", "Bob", "Katty", "Susan", "Jack"]
for i in range(len(names)):
if names[i] == "Jack":
print("It's Jack!")
else:
print("It's not Jack :(")
That's how I understand it. So let's imagine this list contains 50.000 names and we want to search for "Jack". I guess it wouldn't take long.
So how does this Grover's algorithm works? I really can't figure it out.
Grover's search is not indeed a good replacement for classical database lookup methods. (Note that classical databases will have classical indices in them that will speed up the lookup way beyond your implementation.) You can see this paper for a discussion of practical applications of Grover search.
It is more correct to think about the oracle as a tool to recognize the answer, not to find it. For example, if you're looking to solve a SAT problem, the oracle circuit will encode the Boolean formula for a specific instance of a problem you're trying to solve.
If you were to use Grover's algorithm for database search, the oracle would have to encode the condition you're searching for, but also the criteria of whether the element is in a database. For example, if you're looking for a name starting with A, the oracle needs to recognize all strings starting with A, but it also needs to recognize which of the strings are present in the database - otherwise the algorithm will yield a random string starting with A, which is probably not what you were looking for.
Grover's algorithm has practical application when generalized to amplitude amplification, which shows up as a component of many other quantum algorithms. Amplitude amplification is a way of improving the success likelihood of a probabilistic quantum algorithm.

NLP with Python - how to build a corpus, which classifier to use?

I’m trying to figure out which direction to take my Python NLP project in, and I’d be very grateful to the SO community for any advice.
Problem:
Let’s say I have 100 .txt files that contain the minutes of 100 meetings held by a decisionmaking body. I also have 100 .txt files of corresponding meeting outcomes, which contain the resolutions passed by this body. The outcomes fall into one of seven categories – 1 – take no action, 2 – take soft action, 3 – take stronger action, 4 – take strongest action, 5 – cancel soft action previously taken, 6 – cancel stronger action previously taken, 7 – cancel strongest action previously taken. Alternatively, this can be presented on a scale from -3 to +3, with 0 signifying no action, +1 signifying soft action, -1 signifying cancellation of soft action previously taken, and so on.
Based on the text of the inputs, I’m interested in predicting which of these seven outcomes will occur.
I’m thinking of treating this as a form of sentiment analysis, since the decision to take a certain kind of action is basically a sentiment. However, all the sentiment analysis examples I’ve found have focused on positive/negative dichotomies, sometimes adding in neutral sentiment as a category. I haven’t found any examples with more than 3 possible classifications of outcomes – not sure whether this is because I haven’t looked in the right places, because it just isn’t really an approach of interest for whatever reason, or because this approach is a silly idea for some reason of which I’m not yet quite sure.
Question 1. Should be I approaching this as a form of sentiment analysis, or is there some other approach that would work better? Should I instead treat this as a kind of categorization matter, similar to classifying news articles by topic and training the model to recognize the "topic" (outcome)?
Corpus:
I understand that I will need to build a corpus for training/test data, and it looks like I have two immediately evident options:
1 – hand-code a CSV file for training data that would contain some key phrases from each input text and list the value of the corresponding outcome on a 7-point scale, similar to what’s been done here: http://help.sentiment140.com/for-students
2 – use the approach Pang and Lee used (http://www.cs.cornell.edu/people/pabo/movie-review-data/) and put each of my .txt files of inputs into one of seven folders based on outcomes, since the outcomes (what kind of action was taken) are known based on historical data.
The downside to the first option is that it would be very subjective – I would determine which keywords/phrases I think are the most important to include, and I may not necessarily be the best arbiter. The downside to the second option is that it might have less predictive power because the texts are pretty long, contain lots of extraneous words/phrases, and are often stylistically similar (policy speeches tend to use policy words). I looked at Pang and Lee’s data, though, and it seems like that may not be a huge problem, since the reviews they’re using are also not very varied in terms of style. I’m leaning towards the Pang and Lee approach, but I’m not sure if it would even work with more than two types of outcomes.
Question 2. Am I correct in assuming that these are my two general options for building the corpus? Am I missing some other (better) option?
Question 3. Given all of the above, which classifier should I be using? I’m thinking maximum entropy would work best; I’ve also looked into random forests, but I have no experience with the latter and really have no idea what I’m doing (yet) when it comes to them.
Thank you very much in advance :)
Question 1 - The most straightforward way to think of this is as a text classification task (sentiment analysis is one kind of text classification task, but by no means the only one).
Alternatively, as you point out, you could consider your data as existing on a continuum ranging from -3 (cancel strongest action previously taken) to +3 (take strongest action), with 0 (take no action) in the middle. In this case you could treat the outcome as a continuous variable with a natural ordering. If so, then you could treat this as a regression problem rather than a classification problem. It's hard to know whether this is a sensible thing to do without knowing more about the data. If you suspect you will have a number of words/phrases that will be very probable at one end of the scale (-3) and very improbable at the other (+3), or vice versa, then regression may make sense. On the other hand, if the relevant words/phrases are associated with strong emotion and are likely to appear at either end of the scale but not in the middle, then you may be better off treating it as classification. It also depends on how you want to evaluate your results. If your algorithm predicts that a document is a -2 and it's actually a -3, will it be penalized less than if it had predicted +3? If so, it might be better to treat this as a regression task.
Question 2. "Am I correct in assuming that these are my two general options for building the corpus? Am I missing some other (better) option?"
Note that the set of documents (the .txt files of meeting minutes and corresponding outcomes) is your corpus -- the typical thing to do is randomly select 20% or so to be set aside as test data and use the remaining 80% as training data. The two general options you consider above are options for selecting the set of features that your classification or regression algorithm should attend to.
You correctly identify the upsides and downsides of the two most obvious approaches for coming up with features (hand-picking your own vs. Pang & Lee's approach of just using unigrams (words) as phrases).
Personally I'd also lean towards this latter approach, given that it's notoriously hard for humans to predict which phrases will be useful for classification--although there's no reason why you couldn't combine the two, having your initial set of features include all words plus whatever phrases you think might be particularly relevant. As you point out, there will be a lot of extraneous words, so it may help to throw out words that are very infrequent, or that don't differ enough in frequency between classes to provide any discriminative power. Approaches for reducing an initial set of features are known as "feature selection" techniques - one common method is mentioned here. Or see this paper for a more comprehensive list.
You could also consider features like the percent of high-valence words, high-arousal words, or high-dominance words, using the dataset here (click Supplementary Material and download the zip).
Depending on how much effort you want to put into this project, another common thing to do is to try a whole bunch of approaches and see which works best. Of course, you can't test which approach works best using data in the test set--that would be cheating and would run the risk of overfitting to the test data. But you can set aside a small part of your training set as 'validation data' (i.e. a mini-test set that you use for testing different approaches). Given that you don't have that much training data (80 documents or so), you could consider using cross validation.
Question 3 - The best way is probably to try different approaches and pick whatever works best in cross-validation. But if I had to pick one or two, I personally have found that k-nearest neighbor classification (with low k) or SVMs often work well for this kind of thing. A reasonable approach might be
having your initial features be all unigrams (words) + phrases that
you think might be predictive after you look at some training data;
applying a feature selection technique to trim down your feature set;
applying any
algorithm that can deal with high-dimensional/text features, such as those in http://www.csc.kth.se/utbildning/kth/kurser/DD2475/ir10/forelasningar/Lecture9_4.pdf (lots of good tips in that pdf!), or those that achieved decent performance in the Pang & Lee paper.
Other possibilities are discussed in http://nlp.stanford.edu/IR-book/pdf/13bayes.pdf . Often the specific algorithm matters less than the features that go into it. Frankly it sounds like a very difficult sort of classification task, so it's possible that nothing will work very well.
If you decide to treat it as a regression rather than a classification task, you could go with k nearest neighbors regression ( http://www.saedsayad.com/k_nearest_neighbors_reg.htm ) or ridge regression.
Random forests often do not work well with large numbers of dependent features (words), though they may work well if you end up deciding to go with a smaller number of features (for example, a set of words/phrases you manually select, plus % of high-valence words and % of high-arousal words).

Find unknown exponent in Python with very large numbers

I am attempting to find d in Python, such that
(2 ** d) mod n = s2
Where n =
132177882185373774813945506243321607011510930684897434818595314234725602493934515403833460241072842788085178405842019124354553719616350676051289956113618487539608319422698056216887276531560386229271076862408823338669795520077783060068491144890490733649000321192437210365603856143989888494731654785043992278251
and s2 =
18269259493999292542402899855086766469838750310113238685472900147571691729574239379292239589580462883199555239659513821547589498977376834615709314449943085101697266417531578751311966354219681199183298006299399765358783274424349074040973733214578342738572625956971005052398172213596798751992841512724116639637
I am not looking for the solution, but for a reasonably fast way to do this. I've tried using pow and plugging in different values, but this is slow and never gets the solution. How can I find d?
There is no known algorithm that can solve your problem. It's called discrete logarithm problem, and some cryptosystems depend on its complexity (You can't find its solution fast unless you know factorization of n)
Look at the second answer to Is it possible to get RSA private key knowing public key and set of "original data=>encrypted data" entries?. A known-plaintext attack is no easier than known-ciphertext.
The only known discrete logarithm solvers are built around knowing the factors. If you don't have the factors, you need to generate them.
The best reasonable-time algorithm for this is Shor's algorithm. The problem is that you need a quantum computer with enough qubits, and nobody's built one large enough for your sample data yet. And it looks like it'll be quite a few years before anyone does; currently people are still excited about factoring numbers like 15 and 21.
If you want to use classical computing, the best known algorithms are nowhere near "reasonably fast". I believe someone recently showed that the Bonn results on 2^1039-1 should be reproducible in under 4 months with modern PCs. Another 5 years, and maybe it'll be down to a month.
It shouldn't surprise you that there are no known reasonable fast algorithms, because if there were, most private key encryption would be crackable and therefore worthless. It would be major news if someone gave you the answer you're looking for. (Is there an SO question for "Is P=NP?")

Comparing Root-finding (of a function) algorithms in Python

I would like to compare different methods of finding roots of functions in python (like Newton's methods or other simple calc based methods). I don't think I will have too much trouble writing the algorithms
What would be a good way to make the actual comparison? I read up a little bit about Big-O. Would this be the way to go?
The answer from #sarnold is right -- it doesn't make sense to do a Big-Oh analysis.
The principal differences between root finding algorithms are:
rate of convergence (number of iterations)
computational effort per iteration
what is required as input (i.e. do you need to know the first derivative, do you need to set lo/hi limits for bisection, etc.)
what functions it works well on (i.e. works fine on polynomials but fails on functions with poles)
what assumptions does it make about the function (i.e. a continuous first derivative or being analytic, etc)
how simple the method is to implement
I think you will find that each of the methods has some good qualities, some bad qualities, and a set of situations where it is the most appropriate choice.
Big O notation is ideal for expressing the asymptotic behavior of algorithms as the inputs to the algorithms "increase". This is probably not a great measure for root finding algorithms.
Instead, I would think the number of iterations required to bring the actual error below some epsilon ε would be a better measure. Another measure would be the number of iterations required to bring the difference between successive iterations below some epsilon ε. (The difference between successive iterations is probably a better choice if you don't have exact root values at hand for your inputs. You would use a criteria such as successive differences to know when to terminate your root finders in practice, so you could or should use them here, too.)
While you can characterize the number of iterations required for different algorithms by the ratios between them (one algorithm may take roughly ten times more iterations to reach the same precision as another), there often isn't "growth" in the iterations as inputs change.
Of course, if your algorithms take more iterations with "larger" inputs, then Big O notation makes sense.
Big-O notation is designed to describe how an alogorithm behaves in the limit, as n goes to infinity. This is a much easier thing to work with in a theoretical study than in a practical experiment. I would pick things to study that you can easily measure that and that people care about, such as accuracy and computer resources (time/memory) consumed.
When you write and run a computer program to compare two algorithms, you are performing a scientific experiment, just like somebody who measures the speed of light, or somebody who compares the death rates of smokers and non-smokers, and many of the same factors apply.
Try and choose an example problem or problems to solve that is representative, or at least interesting to you, because your results may not generalise to sitations you have not actually tested. You may be able to increase the range of situations to which your results reply if you sample at random from a large set of possible problems and find that all your random samples behave in much the same way, or at least follow much the same trend. You can have unexpected results even when the theoretical studies show that there should be a nice n log n trend, because theoretical studies rarely account for suddenly running out of cache, or out of memory, or usually even for things like integer overflow.
Be alert for sources of error, and try to minimise them, or have them apply to the same extent to all the things you are comparing. Of course you want to use exactly the same input data for all of the algorithms you are testing. Make multiple runs of each algorithm, and check to see how variable things are - perhaps a few runs are slower because the computer was doing something else at a time. Be aware that caching may make later runs of an algorithm faster, especially if you run them immediately after each other. Which time you want depends on what you decide you are measuring. If you have a lot of I/O to do remember that modern operating systems and computer cache huge amounts of disk I/O in memory. I once ended up powering the computer off and on again after every run, as the only way I could find to be sure that the device I/O cache was flushed.
You can get wildly different answers for the same problem just by changing starting points. Pick an initial guess that's close to the root and Newton's method will give you a result that converges quadratically. Choose another in a different part of the problem space and the root finder will diverge wildly.
What does this say about the algorithm? Good or bad?
I would suggest you to have a look at the following Python root finding demo.
It is a simple code, with some different methods and comparisons between them (in terms of the rate of convergence).
http://www.math-cs.gordon.edu/courses/mat342/python/findroot.py
I just finish a project where comparing bisection, Newton, and secant root finding methods. Since this is a practical case, I don't think you need to use Big-O notation. Big-O notation is more suitable for asymptotic view. What you can do is compare them in term of:
Speed - for example here newton is the fastest if good condition are gathered
Number of iterations - for example here bisection take the most iteration
Accuracy - How often it converge to the right root if there is more than one root, or maybe it doesn't even converge at all.
Input - What information does it need to get started. for example newton need an X0 near the root in order to converge, it also need the first derivative which is not always easy to find.
Other - rounding errors
For the sake of visualization you can store the value of each iteration in arrays and plot them. Use a function you already know the roots.
Although this is a very old post, my 2 cents :)
Once you've decided which algorithmic method to use to compare them (your "evaluation protocol", so to say), then you might be interested in ways to run your challengers on actual datasets.
This tutorial explains how to do it, based on an example (comparing polynomial fitting algorithms on several datasets).
(I'm the author, feel free to provide feedback on the github page!)

Categories