I'm new in quantum computing. I mean extremely new. I saw that there is some king of an algorithm called Grover's search algorithm. I have read that it searches through the database containing N-elements in order to find the specific element. I also read that standard computers would be doing it for many, many years while quantum computers would do it in just a few seconds. And that is what confuses me the most. How I understand this:
Let's say we want to search the database containing 50.000 different names and we are looking for a name "Jack". The standard computer wouldn't do it for years right? I think there's matter of seconds or minutes as searching through the database containing names which is probably text won't take long...
Example in python:
names = ["Mark", "Bob", "Katty", "Susan", "Jack"]
for i in range(len(names)):
if names[i] == "Jack":
print("It's Jack!")
else:
print("It's not Jack :(")
That's how I understand it. So let's imagine this list contains 50.000 names and we want to search for "Jack". I guess it wouldn't take long.
So how does this Grover's algorithm works? I really can't figure it out.
Grover's search is not indeed a good replacement for classical database lookup methods. (Note that classical databases will have classical indices in them that will speed up the lookup way beyond your implementation.) You can see this paper for a discussion of practical applications of Grover search.
It is more correct to think about the oracle as a tool to recognize the answer, not to find it. For example, if you're looking to solve a SAT problem, the oracle circuit will encode the Boolean formula for a specific instance of a problem you're trying to solve.
If you were to use Grover's algorithm for database search, the oracle would have to encode the condition you're searching for, but also the criteria of whether the element is in a database. For example, if you're looking for a name starting with A, the oracle needs to recognize all strings starting with A, but it also needs to recognize which of the strings are present in the database - otherwise the algorithm will yield a random string starting with A, which is probably not what you were looking for.
Grover's algorithm has practical application when generalized to amplitude amplification, which shows up as a component of many other quantum algorithms. Amplitude amplification is a way of improving the success likelihood of a probabilistic quantum algorithm.
Related
I have a dataframe (more than 1 million rows) that has an open text columns for customer can write whatever they want.
Misspelled words appear frequently and I'm trying to group comments that are grammatically the same.
For example:
ID
Comment
1
I want to change my credit card
2
I wannt change my creditt card
3
I want change credit caurd
I have tried using Levenshtein Distance but computationally it is very expensive.
Can you tell me another way to do this task?
Thanks!
Levenshtein Distance has time complexity O(N^2).
If you define a maximum distance you're interested in, say m, you can reduce the time complexity to O(Nxm). The maximum distance, in your context, is the maximum number of typos you accept while still considering two comments as identical.
If you cannot do that, you may try to parallelize the task.
this is not a trivial task. If faced with this problem, my approach would be:
Tokenise your sentences. There are many ways to tokenise a sentence, the most straightforward way is to convert a sentence to a list of words. E.g. I want to change my credit card becomes [I, want, to, change, my, credit, card]. Another way is to roll a window of size n across your sentence, e.g. I want to becomes ['I w', ' wa', 'wan', 'ant', ...] for window size 3.
After tokenising your sentence, create an embedding (vectorising), i.e. convert your token to a vector of numbers. The most simple way is to use some ready-made library like sklearn's TfidfVectorizer. If your data cares about the order of the words, then a more sophisticated vectoriser is needed.
After vectorising, use a clustering algorithm. The most simple one is K-Means.
Of course, this is a very complicated task, and there could be a lot of ways to approach this problem. What I described is the simplest out-of-the-box solution. Some clever people have used different strategies to get better results. One example is https://www.youtube.com/watch?v=nlKE4gvJjMo. You need to do this research on this field on your own.
Edit: of course your approach is good for a small dataset. But the difficult part lies in how to perform better than a O(n^2) complexity.
I was given a problem in which you are supposed to write a python code that distributes a number of different weights among 4 boxes.
Logically we can't expect a perfect distribution as in case we are given weights like 10, 65, 30, 40, 50 and 60 kilograms, there is no way of grouping those numbers without making one box heavier than another. But we can aim for the most homogenous distribution. ((60),(40,30),(65),(50,10))
I can't even think of an algorithm to complete this task let alone turn it into python code. Any ideas about the subject would be appreciated.
The problem you're describing is similar to the "fair teams" problem, so I'd suggest looking there first.
Because a simple greedy algorithm where weights are added to the lightest box won't work, the most straightforward solution would be a brute force recursive backtracking algorithm that keeps track of the best solution it has found while iterating over all possible combinations.
As stated in #j_random_hacker's response, this is not going to be something easily done. My best idea right now is to find some baseline. I describe a baseline as an object with the largest value since it cannot be subdivided. Using that you can start trying to match the rest of the data to that value which would only take about three iterations to do. The first and second would create a list of every possible combination and then the third can go over that list and compare the different options by taking the average of each group and storing the closest average value to your baseline.
Using your example, 65 is the baseline and since you cannot subdivide it you know that has to be the minimum bound on your data grouping so you would try to match all of the rest of the values to that. It wont be great, but it does give you something to start with.
As j_random_hacker notes, the partition problem is NP-complete. This problem is also NP-complete by a reduction from the 4-partition problem (the article also contains a link to a paper by Garey and Johnson that proves that 4-partition itself is NP-complete).
In particular, given a list to 4-partition, you could feed that list as an input to a function that solves your box distribution problem. If each box had the same weight in it, a 4-partition would exist, otherwise not.
Your best bet would be to create an exponential time algorithm that uses backtracking to iterate over the 4^n possible assignments. Because unless P = NP (highly unlikely), no polynomial time algorithm exists for this problem.
I have a database table containing well over a million strings. Each string is a term that can vary in length from two words to five or six.
["big giant cars", "zebra videos", "hotels in rio de janeiro".......]
I also have a blacklist of over several thousand smaller terms in a csv file. What I want to do is identify similar terms in the database to the blacklisted terms in my csv file. Similarity in this case can be construed as mis-spellings of the blacklisted terms.
I am familiar with libraries in python such as fuzzywuzzy that can assess string similarity using Levensthein distance and return an integer representation of the similarity. An example from this tutorial would be:
fuzz.ratio("NEW YORK METS", "NEW YORK MEATS") ⇒ 96
A downside with this approach would be that it may falsely identify terms that may mean something in a different context.
A simple example of this would be "big butt", a blacklisted string, being confused with a more innocent string like "big but".
My question is, is it programmatically possible in python to accomplish this or would it be easier to just retrieve all the similar looking keywords and filter for false positives?
I'm not sure there's any definitive answer to this problem, so the best I can do is to explain how I'd approach this problem, and hopefully you'll be able to get any ideas from my ramblings. :-)
First.
On an unrelated angle, fuzzy string matching might not be enough. People are going to be using similar-looking characters and non-character symbols to get around any text matches, to the point where there's nearly zero match between a blacklisted word and actual text, and yet it's still readable for what it is. So perhaps you will need some normalization of your dictionary and search text, like converting all '0' (zeroes) to 'O' (capital O), '><' to 'X' etc. I believe there are libraries and/or conversion references to that purpose. Non-latin symbols are also a distinct possibility and should be accounted for.
Second.
I don't think you'll be able to differentiate between blacklisted words and similar-looking legal variants in a single pass. So yes, most likely you will have to search for possible blacklisted matches and then check if what you found matches some legal words too. Which means you will need not only the blacklisted dictionary, but a whitelisted dictionary as well. On a more positive note, there's probably no need to normalize the whitelisted dictionary, as people who're writing acceptable text are probably going to write it in acceptable language without any tricks outlined above. Or you could normalize it if you're feeling paranoid. :-)
Third.
However the problem is that matching words/expressions against black and white lists doesn't actually give you a reliable answer. Using your example, a person might write "big butt" as a honest typo which will be obvious in context (or vice versa, write a "big but" intentionally to get a higher match against a whitelisted word, even if context makes it quite obvious what the real meaning is). So you might have to actually check the context in case there are good enough matches against both black and white lists. This is an area I'm not intimately familiar with. It's probably possible to build correlation maps for various words (from both dictionaries) to identify what words are more (or less) frequently used in combination with them, and use them to check your specific example. Using this very paragraph as example, a word "black" could be whitelisted if it's used together with "list" but blacklisted in some other situations.
Fourth.
Even applying all those measures together you might want to leave a certain amount of gray area. That is, unless there's a high enough certainty in either direction, leave the final decision for a human (screening comments/posts for a time, automatically putting them into moderation queue, or whatever else your project dictates).
Fifth.
You might try to dabble in learning algorithms, collecting human input from previous step and using it to automatically fine-tune your algorithm as time goes by.
Hope that helps. :-)
I am a newbie in python and have been trying my hands on different problems which introduce me to different modules and functionalities (I find it as a good way of learning).
I have googled around a lot but haven't found anything close to a solution to the problem.
I have a large data set of facebook posts from various groups on facebooks that use it as a medium to mass send the knowledge.
I want to make groups out of these posts which are content-wise same.
For example, one of the posts is "xyz.com is selling free domains. Go register at xyz.com"
and another is "Everyone needs to register again at xyz.com. Due to server failure, all data has been lost."
These are similar as they both ask to go the group's website and register.
P.S: Just a clarification, if any one of the links would have been abc.com, they wouldn't have been similar.
Priority is to the source and then to the action (action being registering here).
Is there a simple way to do it in python? (a module maybe?)
I know it requires some sort of clustering algorithm ( correct me if I am wrong), my question is can python make this job easier for me somehow? some module or anything?
Any help is much appreciated!
Assuming you have a function called geturls that takes a string and returns a list of urls contained within, I would do it like this:
from collections import defaultdict
groups = defaultdict(list):
for post in facebook_posts:
for url in geturls(post):
groups[url].append(post)
That greatly depends on your definition of being "content-wise same". A straight forward approach is to use a so-called Term Frequency - Inverse Document Frequency (TFIDF) model.
Simply put, make a long list of all words in all your posts, filter out stop-words (articles, determiners etc.) and for each document (=post) count how often each term occurs, and multiplying that by the importance of the team (which is the inverse document frequency, calculated by the log of the ratio of documents in which this term occurs). This way, words which are very rare will be more important than common words.
You end up with a huge table in which every document (still, we're talking about group posts here) is represented by a (very sparse) vector of terms. Now you have a metric for comparing documents. As your documents are very short, only a few terms will be significantly high, so similar documents might be the ones where the same term achieved the highest score (ie. the highest component of the document vectors is the same), or maybe the euclidean distance between the three highest values is below some parameter. That sounds very complicated, but (of course) there's a module for that.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Where can I find some real world typo statistics?
I'm trying to match people's input text to internal objects, and people tend to make spelling mistakes.
There are 2 kinds of mistakes:
typos - "Helllo" instead of "Hello" / "Satudray" instead of "Saturday" etc.
Spelling - "Shikago" instead of "Chicago"
I use Damerau-Levenshtein distance for the typos and Double Metaphone for spelling (Python implementations here and here).
I want to focus on the Damerau-Levenshtein (or simply edit-distance). The textbook implementations always use '1' for the weight of deletions, insertions substitutions and transpositions. While this is simple and allows for nice algorithms it doesn't match "reality" / "real-world probabilities".
Examples:
I'm sure the likelihood of "Helllo" ("Hello") is greater than "Helzlo", yet they are both 1 edit distance away.
"Gello" is closer than "Qello" to "Hello" on a QWERTY keyboard.
Unicode transliterations: What is the "real" distance between "München" and "Munchen"?
What should the "real world" weights be for deletions, insertions, substitutions, and transpositions?
Even Norvig's very cool spell corrector uses non-weighted edit distance.
BTW- I'm sure the weights need to be functions and not simple floats (per the above
examples)...
I can adjust the algorithm, but where can I "learn" these weights? I don't have access to Google-scale data...
Should I just guess them?
EDIT - trying to answer user questions:
My current non-weighted algorithm fails often when faced with typos for the above reasons. "Return on Tursday": every "real person" can easily tell Thursday is more likely than Tuesday, yet they are both 1-edit-distance away! (Yes, I do log and measure my performance).
I'm developing an NLP Travel Search engine, so my dictionary contains ~25K destinations (expected to grow to 100K), Time Expressions ~200 (expected 1K), People expressions ~100 (expected 300), Money Expressions ~100 (expected 500), "glue logic words" ("from", "beautiful", "apartment") ~2K (expected 10K) and so on...
Usage of the edit distance is different for each of the above word-groups. I try to "auto-correct when obvious", e.g. 1 edit distance away from only 1 other word in the dictionary. I have many other hand-tuned rules, e.g. Double Metaphone fix which is not more than 2 edit distance away from a dictionary word with a length > 4... The list of rules continues to grow as I learn from real world input.
"How many pairs of dictionary entries are within your threshold?": well, that depends on the "fancy weighting system" and on real world (future) input, doesn't it? Anyway, I have extensive unit tests so that every change I make to the system only makes it better (based on past inputs, of course). Most sub-6 letter words are within 1 edit distance from a word that is 1 edit distance away from another dictionary entry.
Today when there are 2 dictionary entries at the same distance from the input I try to apply various statistics to better guess which the user meant (e.g. Paris, France is more likely to show up in my search than Pārīz, Iran).
The cost of choosing a wrong word is returning semi-random (often ridiculous) results to the end-user and potentially losing a customer. The cost of not understanding is slightly less expensive: the user will be asked to rephrase.
Is the cost of complexity worth it? Yes, I'm sure it is. You would not believe the amount of typos people throw at the system and expect it to understand, and I could sure use the boost in Precision and Recall.
Possible source for real world typo statistics would be in the Wikipedia's complete edit history.
http://download.wikimedia.org/
Also, you might be interested in the AWB's RegExTypoFix
http://en.wikipedia.org/wiki/Wikipedia:AWB/T
I would advise you to check the trigram alogrithm. In my opinion it works better for finding typos then edit distance algorithm. It should work faster as well and if you keep dictionary in postgres database you can make use of index.
You may find useful stackoverflow topic about google "Did you mean"
Probability Scoring for Spelling Correction by Church and Gale might help. In that paper, the authors model typos as a noisy channel between the author and the computer. The appendix has tables for typos seen in a corpus of Associated Press publications. There is a table for each of the following kinds of typos:
deletion
insertion
substitution
transposition
For example, examining the insertion table, we can see that l was incorrectly inserted after l 128 times (the highest number in that column). Using these tables, you can generate the probabilities you're looking for.
If the research is your interest I think continuing with that algorithm, trying to find decent weights would be fruitful.
I can't help you with typo stats, but I think you should also play with python's difflib. Specifically, the ratio() method of SequenceMatcher. It uses an algorithm which the docs http://docs.python.org/library/difflib.html claim is well suited to matches that 'look right', and may be useful to augment or test what you're doing.
For python programmers just looking for typos it is a good place to start. One of my coworkers has used both Levenshtein edit distance and SequenceMatcher's ratio() and got much better results from ratio().
Some questions for you, to help you determine whether you should be asking your "where do I find real-world weights" question:
Have you actually measured the effectiveness of the uniform weighting implementation? How?
How many different "internal objects" do you have -- i.e. what is the size of your dictionary?
How are you actually using the edit distance e.g. John/Joan, Marmaduke/Marmeduke, Featherstonehaugh/Featherstonhaugh: is that "all 1 error" or is it 25% / 11.1% / 5.9% difference? What threshold are you using?
How many pairs of dictionary entries are within your threshold (e.g. John vs Joan, Joan vs Juan, etc)? If you introduced a fancy weighting system, how many pairs of dictionary entries would migrate (a) from inside the threshold to outside (b) vice versa?
What do you do if both John and Juan are in your dictionary and the user types Joan?
What are the penalties/costs of (1) choosing the wrong dictionary word (not the one that the user meant) (2) failing to recognise the user's input?
Will introducing a complicated weighting system actually reduce the probabilities of the above two error types by sufficient margin to make the complication and slower speed worthwhile?
BTW, how do you know what keyboard the user was using?
Update:
"""My current non-weighted algorithm fails often when faced with typos for the above reasons. "Return on Tursday": every "real person" can easily tell Thursday is more likely than Tuesday, yet they are both 1-edit-distance away! (Yes, I do log and measure my performance)."""
Yes, Thursday -> Tursday by omitting an "h", but Tuesday -> Tursday by substituting "r" instead of "e". E and R are next to each other on qwERty and azERty keyboards. Every "real person" can easily guess that Thursday is more likely than Tuesday. Even if statistics as well as guesses point to Thursday being more likely than Tuesday (perhaps omitting h will cost 0.5 and e->r will cost 0.75), will the difference (perhaps 0.25) be significant enough to always pick Thursday? Can/will your system ask "Did you mean Tuesday?" or does/will it just plough ahead with Thursday?