Identifying similar strings in a database in Python - python

I have a database table containing well over a million strings. Each string is a term that can vary in length from two words to five or six.
["big giant cars", "zebra videos", "hotels in rio de janeiro".......]
I also have a blacklist of over several thousand smaller terms in a csv file. What I want to do is identify similar terms in the database to the blacklisted terms in my csv file. Similarity in this case can be construed as mis-spellings of the blacklisted terms.
I am familiar with libraries in python such as fuzzywuzzy that can assess string similarity using Levensthein distance and return an integer representation of the similarity. An example from this tutorial would be:
fuzz.ratio("NEW YORK METS", "NEW YORK MEATS") ⇒ 96
A downside with this approach would be that it may falsely identify terms that may mean something in a different context.
A simple example of this would be "big butt", a blacklisted string, being confused with a more innocent string like "big but".
My question is, is it programmatically possible in python to accomplish this or would it be easier to just retrieve all the similar looking keywords and filter for false positives?

I'm not sure there's any definitive answer to this problem, so the best I can do is to explain how I'd approach this problem, and hopefully you'll be able to get any ideas from my ramblings. :-)
First.
On an unrelated angle, fuzzy string matching might not be enough. People are going to be using similar-looking characters and non-character symbols to get around any text matches, to the point where there's nearly zero match between a blacklisted word and actual text, and yet it's still readable for what it is. So perhaps you will need some normalization of your dictionary and search text, like converting all '0' (zeroes) to 'O' (capital O), '><' to 'X' etc. I believe there are libraries and/or conversion references to that purpose. Non-latin symbols are also a distinct possibility and should be accounted for.
Second.
I don't think you'll be able to differentiate between blacklisted words and similar-looking legal variants in a single pass. So yes, most likely you will have to search for possible blacklisted matches and then check if what you found matches some legal words too. Which means you will need not only the blacklisted dictionary, but a whitelisted dictionary as well. On a more positive note, there's probably no need to normalize the whitelisted dictionary, as people who're writing acceptable text are probably going to write it in acceptable language without any tricks outlined above. Or you could normalize it if you're feeling paranoid. :-)
Third.
However the problem is that matching words/expressions against black and white lists doesn't actually give you a reliable answer. Using your example, a person might write "big butt" as a honest typo which will be obvious in context (or vice versa, write a "big but" intentionally to get a higher match against a whitelisted word, even if context makes it quite obvious what the real meaning is). So you might have to actually check the context in case there are good enough matches against both black and white lists. This is an area I'm not intimately familiar with. It's probably possible to build correlation maps for various words (from both dictionaries) to identify what words are more (or less) frequently used in combination with them, and use them to check your specific example. Using this very paragraph as example, a word "black" could be whitelisted if it's used together with "list" but blacklisted in some other situations.
Fourth.
Even applying all those measures together you might want to leave a certain amount of gray area. That is, unless there's a high enough certainty in either direction, leave the final decision for a human (screening comments/posts for a time, automatically putting them into moderation queue, or whatever else your project dictates).
Fifth.
You might try to dabble in learning algorithms, collecting human input from previous step and using it to automatically fine-tune your algorithm as time goes by.
Hope that helps. :-)

Related

Identify domain related important keywords from a given text

I am relatively new to the field of NLP/text processing. I would like to know how to identify domain-related important keywords from a given text.
For example, if I have to build a Q&A chatbot that will be used in the Banking domain, the Q would be like: What is the maturity date for TRADE:12345 ?
From the Q, I would like to extract the keywords: maturity date & TRADE:12345.
From the extracted information, I would frame a SQL-like query, search the DB, retrieve the SQL output and provide the response back to the user.
Any help would be appreciated.
Thanks in advance.
So, this is where the work comes in.
Normally people start with a stop word list. There are several, choose wisely. But more than likely you'll experiment and/or use a base list and then add more words to that list.
Depending on the list it will take out
"what, is, the, for, ?"
Since this a pretty easy example, they'll all do that. But you'll notice that what is being done is just the opposite of what you wanted. You asked for domain-specific words but what is happening is the removal of all that other cruft (to the library).
From here it will depend on what you use. NLTK or Spacy are common choices. Regardless of what you pick, get a real understanding of concepts or it can bite you (like pretty much anything in Data Science).
Expect to start thinking in terms of linguistic patterns so, in your example:
What is the maturity date for TRADE:12345 ?
'What' is an interrogative, 'the' is a definite article, 'for' starts a prepositional phrase.
There may be other clues such as the ':' or that TRADE is in all caps. But, it might not be.
That should get you started but you might look at some of the other StackExchange sites for deeper expertise.
Finally, you want to break a question like this into more than one question (assuming that you've done the research and determined the question hasn't already been asked -- repeatedly). So, NLTK and NLP are decently new, but SQL queries are usually a Google search.

Python3 remove multiple hyphenations from a german string

I'm currently working on a neural network that evaluates students' answers to exam questions. Therefore, preprocessing the corpora for a Word2Vec network is needed. Hyphenation in german texts is quite common. There are mainly two different types of hyphenation:
1) End of line:
The text reaches the end of the line so the last word is sepa-
rated.
2) Short form of enumeration:
in case of two "elements":
Geistes- und Sozialwissenschaften
more "elements":
Wirtschafts-, Geistes- und Sozialwissenschaften
The de-hyphenated form of these enumerations should be:
Geisteswissenschaften und Sozialwissenschaften
Wirtschaftswissenschaften, Geisteswissenschaften und Sozialwissenschaften
I need to remove all hyphenations and put the words back together. I already found several solutions for the first problem.
But I have absoluteley no clue how to get the second part (in the example above "wissenschaften") of the words in the enumeration problem. I don't even know if it is possible at all.
I hope that I have pointet out my problem properly.
So has anyone an idea how to solve this problem?
Thank you very much in advance!
It's surely possible, as the pattern seems fairly regular. (Something vaguely analogous is sometimes seen in English. For example: The new requirements applied to under-, over-, and average-performing employees.)
The rule seems to be roughly, "when you see word-fragments with a trailing hyphen, and then an und, look for known words that begin with the word-fragments, and end the same as the terminal-word-after-und – and replace the word-fragments with the longer words".
Not being a German speaker and without language-specific knowledge, it wouldn't be possible to know exactly where breaks are appropriate. That is, in your Geistes- und Sozialwissenschaften example, without language-specific knowledge, it's unclear whether the first fragment should become Geisteszialwissenschaften or Geisteswissenschaften or Geistesenschaften or Geiestesaften or any other shared-suffix with Sozialwissenschaften. But if you've got a dictionary of word-fragments, or word-frequency info from other text that uses the same full-length word(s) without this particular enumeration-hyphenation, that could help choose.
(If there's more than one plausible suffix based on known words, this might even be a possible application of word2vec: the best suffix to choose might well be the one that creates a known-word that is closest to the terminal-word in word-vector-space.)
Since this seems a very German-specific issue, I'd try asking in forums specific to German natural-language-processing, or to libraries with specific German support. (Maybe, NLTK or Spacy?)
But also, knowing word2vec, this sort of patch-up may not actually be that important to your end-goals. Training without this logical-reassembly of the intended full words may still let the fragments achieve useful vectors, and the corresponding full words may achieve useful vectors from other usages. The fragments may wind up close enough to the full compound words that they're "good enough" for whatever your next regression/classifier step does. So if this seems a blocker, don't be afraid to just try ignoring it as a non-problem. (Then if you later find an adequate de-hyphenation approach, you can test whether it really helped or not.)

Heuristics for determining whether something is a "word" or random data?

I am writing a web crawler in python that downloads a list of URLS, extracts all visible text from the HTML, tokenizes the text (using nltk.tokenize) and then creates a positional inverted index of words in each document for use by a search feature.
However, right now, the index contains a bunch of useless entries like:
1) //roarmag.org/2015/08/water-conflict-turkey-middle-east/
2) ———-
3) ykgnwym+ccybj9z1cgzqovrzu9cni0yf7yycim6ttmjqroz3wwuxiseulphetnu2
4) iazl+xcmwzc3da==
Some of these, like #1, are where URLs appear in the text. Some, like #3, are excerpts from PGP keys, or other random data that is embedded in the text.
I am trying to understand how to filter out useless data like this. But I don't just want to keep words that I would find in an English dictionary, but also things like names, places, nonsense words like "Jabberwocky" or "Rumpelstiltskin", acronyms like "TANSTAAFL", obscure technical/scientific terms, etc ...
That is, I'm looking for a way to heuristically strip out strings that are "jibberish". (1) exceedingly "long" (2) filled with a bunch of punctuation (3) composed of random strings of characters like afhdkhfadhkjasdhfkldashfkjahsdkfhdsakfhsadhfasdhfadskhkf ... I understand that there is no way to do this with 100% accuracy, but if I could remove even 75% of the junk I'd be happy.
Are there any techniques that I can use to separate "words" from junk data like this?
Excessively long words are trivial to filter. It's pretty easy to filter out URLs, too. I don't know about Python, but other languages have libraries you can use to determine if something is a relative or absolute URL. Or you could just use your "strings with punctuation" filter to filter out anything that contains a slash.
Words are trickier, but you can do a good job with n-gram language models. Basically, you build or obtain a language model, and run each string through the model to determine the likelihood of that string being a word in the particular language. For example, "Rumplestiltskin" will have a much higher likelihood of being an English word than, say, "xqjzipdg".
See https://github.com/ciprian-chelba/1-billion-word-language-modeling-benchmark for a trained model that might be useful to you in determining if a string is an actual word in some language.
See also NLTK and language detection.

definite vs indefinite article usage corrector

I'm writing a program that corrects 'a/an' vs 'the' article usage . I've been able to detect case of plurality ( article is always 'the' when the corresponding noun is plural ) .
I'm stumped on how to solve this issue for singular nouns. Without context, both "an apple" and the "apple" are correct. How would I approach such cases ?
I don't think this is something you will be able to get 100% accuracy on, but it seems to me that one of the most important cues is previous mention. If no apple has been mentioned before, then it is a little odd to say 'the apple'.
A very cheap (and less accurate) approach is to literally check for a token 'apple' in the preceding context and use that as a feature, possibly in conjunction with many other features, such as:
position in text (definiteness becomes likelier as the text progresses)
grammatical function via a dependency parse (grammatical subjects more likely to be definite)
phrase length (definite mentions are typically shorter, fewer adjectives)
etc. etc.
A better but more complex approach would be to insert "the" and then use a coreference resolution component to attempt to find a previous mention. Although automatic coreference resolution is not perfect, it is the best way to determine if there is a previous mention using NLP, and most systems will also attempt to resolve non-trivial cases, such as "John has Malaria ... the disease", which a simple string lookup will miss, as well as distiguishing non-co-referring mentions: a red apple ... != a green apple.
Finally, there is a large amount of nouns which can appear with an article despite not being mentioned previously, including names ("the Olympic Games"), generics ("the common ant"), contextually inferable words ("pass the salt") and uniquely identifiable ("the sun"). All of these could be learned from a training corpus, but that would probably require a separate classifier.
Hope this helps!

How to measure similarity between two python code blocks?

Many would want to measure code similarity to catch plagiarisms, however my intention is to cluster a set of python code blocks (say answers to the same programming question) into different categories and distinguish different approaches taken by students.
If you have any idea how this could be achieved, I would appreciate it if you share it here.
You can choose any scheme you like that essentially hashes the contents of the code blocks, and place code blocks with identical hashes into the same category.
Of course, what will turn out to be similar will then depend highly on how you defined the hashing function. For instance, a truly stupid hashing function H(code)==0 will put everything in the same bin.
A hard problem is finding a hashing function that classifies code blocks in a way that seems similar in a natural sense. With lots of research, nobody has yet found anything better to judge this than I'll know if they are similar when I see them.
You surely do not want it to be dependent on layout/indentation/whitespace/comments, or slight changes to these will classify blocks differently even if their semantic content is identical.
There are three major schemes people have commonly used to find duplicated (or similar) code:
Metrics-based schemes, which compute the hash by counting various type of operators and operands by computing a metric. (Note: this uses lexical tokens). These often operate only at the function level. I know of no practical tools based on this.
Lexically based schemes, which break the input stream into lexemes, convert identifiers and literals into fixed special constants (e.g, treat them as undifferentiated), and then essentially hash N-grams (a sequence of N tokens) over these sequences. There are many clone detectors based on essentially this idea; they work tolerably well, but also find stupid matches because nothing forces alignment with program structure boundaries.
The sequence
return ID; } void ID ( int ID ) {
is an 11 gram which occurs frequently in C like languages but clearly isn't a useful clone). The result is that false positives tend to occur, e.g, you get claimed matches where there isn't one.
Abstract syntax tree based matching, (hashing over subtrees) which automatically aligns clones to language boundaries by virtue of using the ASTs, which represent the language structures directly. (I'm the author of the original paper on this, and build a commercial product CloneDR based on the idea, see my bio). These tools have the advantage that they can match code that contains sequences of tokens of different lengths in the middle of a match, e.g., one statement (of arbitrary size) is replaced by another.
This paper provides a survey of the various techniques: http://www.cs.usask.ca/~croy/papers/2009/RCK_SCP_Clones.pdf. It shows that AST-based clone detection tools appear to be the most effective at producing clones that people agree are similar blocks of code, which seems key to OP's particular interest; see Table 14.
[There are graph-based schemes that match control and data flow graphs. They should arguably produce even better matches but apparantly do not do much better in practice.]
One approach would be to count then number of functions, objects, keywords possibly grouped into categories such as branching, creating, manipulating, etc., and number variables of each type. Without relying on the methods and variables being called the same name(s).
For a given problem the similar approaches will tend to come out with similar scores for these, e.g.: A students who used decision tree would have a high number of branch statements while one who used a decision table would have much lower.
This approach would be much quicker to implement than parsing the code structure and comparing the results.

Categories