I am wondering if there is a Python library can conduct fuzzy text search. For example:
I have three keywords "letter", "stamp", and "mail".
I would like to have a function to check if those three words are within
the same paragraph (or certain distances, one page).
In addition, those words have to maintain the same order. It is fine that other words appear between those three words.
I have tried fuzzywuzzy which did not solve my problem. Another library, Whoosh, looks powerful, but I did not find the proper function.
{1}
You can do this in Whoosh 2.7. It has fuzzy search by adding the plugin whoosh.qparser.FuzzyTermPlugin:
whoosh.qparser.FuzzyTermPlugin lets you search for “fuzzy” terms, that is, terms that don’t have to match exactly. The fuzzy term will match any similar term within a certain number of “edits” (character insertions, deletions, and/or transpositions – this is called the “Damerau-Levenshtein edit distance”).
To add the fuzzy plugin:
parser = qparser.QueryParser("fieldname", my_index.schema)
parser.add_plugin(qparser.FuzzyTermPlugin())
Once you add the fuzzy plugin to the parser, you can specify a fuzzy term by adding a ~ followed by an optional maximum edit distance. If you don’t specify an edit distance, the default is 1.
For example, the following “fuzzy” term query:
letter~
letter~2
letter~2/3
{2} To keep words in order, use the Query whoosh.query.Phrase but you should replace Phrase plugin by whoosh.qparser.SequencePlugin that allows you to use fuzzy terms inside a phrase:
"letter~ stamp~ mail~"
To replace the default phrase plugin with the sequence plugin:
parser = qparser.QueryParser("fieldname", my_index.schema)
parser.remove_plugin_class(qparser.PhrasePlugin)
parser.add_plugin(qparser.SequencePlugin())
{3} To allow words between, initialize the slop arg in your Phrase query to a greater number:
whoosh.query.Phrase(fieldname, words, slop=1, boost=1.0, char_ranges=None)
slop – the number of words allowed between each “word” in the phrase; the default of 1 means the phrase must match exactly.
You can also define slop in Query like this:
"letter~ stamp~ mail~"~10
{4} Overall solution:
{4.a} Indexer would be like:
from whoosh.index import create_in
from whoosh.fields import *
schema = Schema(title=TEXT(stored=True), content=TEXT)
ix = create_in("indexdir", schema)
writer = ix.writer()
writer.add_document(title=u"First document", content=u"This is the first document we've added!")
writer.add_document(title=u"Second document", content=u"The second one is even more interesting!")
writer.add_document(title=u"Third document", content=u"letter first, stamp second, mail third")
writer.add_document(title=u"Fourth document", content=u"stamp first, mail third")
writer.add_document(title=u"Fivth document", content=u"letter first, mail third")
writer.add_document(title=u"Sixth document", content=u"letters first, stamps second, mial third wrong")
writer.add_document(title=u"Seventh document", content=u"stamp first, letters second, mail third")
writer.commit()
{4.b} Searcher would be like:
from whoosh.qparser import QueryParser, FuzzyTermPlugin, PhrasePlugin, SequencePlugin
with ix.searcher() as searcher:
parser = QueryParser(u"content", ix.schema)
parser.add_plugin(FuzzyTermPlugin())
parser.remove_plugin_class(PhrasePlugin)
parser.add_plugin(SequencePlugin())
query = parser.parse(u"\"letter~2 stamp~2 mail~2\"~10")
results = searcher.search(query)
print "nb of results =", len(results)
for r in results:
print r
That gives the result:
nb of results = 2
<Hit {'title': u'Sixth document'}>
<Hit {'title': u'Third document'}>
{5} If you want to set fuzzy search as default without using the syntax word~n in each word of the query, you can initialize QueryParser like this:
from whoosh.query import FuzzyTerm
parser = QueryParser(u"content", ix.schema, termclass = FuzzyTerm)
Now you can use the query "letter stamp mail"~10 but keep in mind that FuzzyTerm has default edit distance maxdist = 1. Personalize the class if you want bigger edit distance:
class MyFuzzyTerm(FuzzyTerm):
def __init__(self, fieldname, text, boost=1.0, maxdist=2, prefixlength=1, constantscore=True):
super(D, self).__init__(fieldname, text, boost, maxdist, prefixlength, constantscore)
# super().__init__() for Python 3 I think
References:
whoosh.query.Phrase
Adding fuzzy term queries
Allowing complex phrase queries
class whoosh.query.FuzzyTerm
qparser module
Related
I am using the Eldar (https://github.com/kerighan/eldar) package to validate pieces of text with a boolean search string. Doing this, I also want to validate word combinations. For example, this is an example (I deliberately use match_word=True, I would like that the solution to my problem also includes this):
eldar = Query('("fiscale economie")', ignore_case=True, ignore_accent=True, match_word=True)
print(eldar("fiscale economie"))
The result of this is False, because Eldar doesn't seem to recognise the "space" that is between "fiscale" and "economie". Is there a way that Eldar can validate even word combinations, while match_word remains True?
You can use AND operator in a query. The downside of the method is that the query will match fiscale and economie whatever positions they take in the string. All documents match the query in the example below:
from eldar import Query
documents = [
"economie fiscale",
"fiscale something economie",
"economie something fiscale",
"some fiscale economie thing",
"fiscale economie"
]
eldar = Query('"fiscale" AND "economie"', ignore_case=True, ignore_accent=True, match_word=True)
print(eldar.filter(documents))
Unfortunately, a wildcard won't work here due to it's implementation.
Another way is to create Index. Searching by economie fiscale returns the last two documents as expected:
from eldar import Index
documents = [
"economie fiscale",
"fiscale something economie",
"economie something fiscale",
"some fiscale economie thing",
"fiscale economie"
]
index = Index(ignore_case=True, ignore_accent=True)
index.build(documents)
index.save("index.p")
index = Index.load("index.p")
print(index.search('fiscale economie'))
I have a large set of long text documents with punctuation. Three short examples are provided here:
doc = ["My house, the most beautiful!, is NEAR the #seaside. I really love holidays, do you?", "My house, the most beautiful!, is NEAR the #seaside. I really love holidays, do you love dogs?", "My house, the most beautiful!, is NEAR the #sea. I really love holidays, do you?"]
and I have sets of words like the following:
wAND = set(["house", "near"])
wOR = set(["seaside"])
wNOT = set(["dogs"])
I want to search all text documents that meet the following condition:
(any(w in doc for w in wOR) or not wOR) and (all(w in doc for w in wAND) or not wAND) and (not any(w in doc for w in wNOT) or not wNOT)
The or not condition in each parenthesis is needed as the three lists could be empty. Please notice that before applying the condition I also need to clean text from punctuation, transform it to lowercase, and split it into a set of words, which requires additional time.
This process would match the first text in doc but not the second and the third. Indeed, the second would not match as it contains the word "dogs" and the third because it does not include the word "seaside".
I am wondering if this general problem (with words in the wOR, wAND and wNOT lists changing) can be solved in a faster way, avoiding text pre-processing for cleaning. Maybe with a fast regex solution, that perhaps uses Trie(). Is that possible? or any other suggestion?
Your solution appears to be linear in the length of the document - you won't be able to get any better than this without sorting, as the words you're looking for could be anywhere in the document. You could try using one loop over the entire doc:
or_satisfied = False
for w in doc:
if word in wAND: wAND.remove(word)
if not or_satisfied and word in wOR: or_satisfied = True
if word in wNOT: return False
return or_satisfied and not wAND
You can build regexps for the word bags you have, and use them:
def make_re(word_set):
return re.compile(
r'\b(?:{})\b'.format('|'.join(re.escape(word) for word in word_set)),
flags=re.I,
)
wAND_re = make_re(wAND)
wOR_re = make_re(wOR)
wNOT_re = make_re(wNOT)
def re_match(doc):
if not wOR_re.search(doc):
return False
if wNOT_re.search(doc):
return False
found = set()
expected = len(wAND)
for word in re.finditer(r'\w+', doc):
found.add(word)
if len(found) == expected:
break
return len(found) == expected
A quick timetest seems to say this is 89% faster than the original (and passes the original "test suite"), likely clearly due to the fact that
documents don't need to be cleaned (since the \bs limit matches to words and re.I deals with case normalization)
regexps are run in native code, which tends to always be faster than Python
name='original' iters=10000 time=0.206 iters_per_sec=48488.39
name='re_match' iters=20000 time=0.218 iters_per_sec=91858.73
name='bag_match' iters=10000 time=0.203 iters_per_sec=49363.58
where bag_match is my original comment suggestion of using set intersections:
def bag_match(doc):
bag = set(clean_doc(doc))
return (
(bag.intersection(wOR) or not wOR) and
(bag.issuperset(wAND) or not wAND) and
(not bag.intersection(wNOT) or not wNOT)
)
If you already have cleaned the documents to an iterable of words (here I just slapped #lru_cache on clean_doc, which you probably wouldn't do in real life since your documents are likely to all be unique and caching wouldn't help), then bag_match is much faster:
name='orig-with-cached-clean-doc' iters=50000 time=0.249 iters_per_sec=200994.97
name='re_match-with-cached-clean-doc' iters=20000 time=0.221 iters_per_sec=90628.94
name='bag_match-with-cached-clean-doc' iters=100000 time=0.265 iters_per_sec=377983.60
I've list of queries and list of documents like this
queries = ['drug dosage form development Society', 'new drugs through activity evaluation of some medicinally used plants', ' Evaluation of drugs available on market for their quality, effectiveness']
docs = ['A Comparison of Urinalysis Technologies for Drugs Testing in Criminal Justice', 'development society and health care', 'buying drugs on the market without prescription may results in death', 'plants and their contribution in pharmacology', 'health care in developing countries']
I want to print document as related one if at least one similar word exists in both query and document. I've tried this code based on one answer of python: finding substring within a list post. but it did not work.
query = [subquery for subquery in queries]
for i in query:
sub = i
for doc in docs:
if str(i) in docs:
print docs
any help is appreciable
Your code(for i in query:) is searching for sentence not words.
To search for words, first you have to split query sentence into words.
for q in queries:
for word in q.strip().split(" "):
print word
Complete code:
for q in queries:
for word in q.strip().split(" "):
for doc in docs:
if word in doc:
print doc
Note: above code will also search for in, for, of, on etc in doc
An efficient way of doing this would be to build an Inverted Index. The one I've implemented below is a dirty inverted index.
words = {}
for index, doc in enumerate(docs):
for word in doc.split(" "):
if not word or word==" ":
pass
elif not word in words: words[word]=[index]
elif index not in words[word]: words[word].append(index)
for query in queries:
matches = []
map(lambda x: matches.extend(words[x]), filter(lambda x: x in query, words))
print list(set(matches))
In an ideal world, your code would also include
Stopwords - words that shouldn't be indexed, such as "for" or "the" from the documents.
Stemming - mapping a word to its stem allowing for alternate grammatical searches. For instance, running --> run, runs --> run, runner --> run. Thus, using any of the terms would bring documents that contained the word run with all it's forms.
Synonyms - look up synonyms in Wordnet or similar databases. Eg. vehicle would also bring up documents containing the word "car".
Relevance Ranking - documents retrieved can be ranked per frequency of the search term with respect to the total number of words in the document.
All of the above can be added as additional modules on the index and the search engine you're creating as per need.
How do I match string items in list to a location on a larger string, particularly if those string items were derived from the larger string?
I currently receive my output from AlchemyAPI in this format.
each list = Name of Entity, Count of Entity Appears in text, Entity Type, Entity Sentiment
[['Social Entrepreneurship', u'25', u'Organization', u'0.854779'],
['Skoll Centre for Social Entrepreneurship',
u'6',
u'Organization',
u'0.552907'],
However, in order to evaluate the accuracy of this NER output, I'd like to map my alchemyAPI output of the entity type to text I already have. So for instance, if my text is the following (this is also the text I used to get my output for Alchemy API)
If
Social
Entrepreneurship
acts
like
This
Social
Entrepreneurship
I'd like to have the fact that Social Entrepreneurship is mentioned 25 times as an ORG applied to my text. So, this would be a snippet of those 25 times.
If
Social ORG
Entrepreneurship ORG
acts
like
This
Social ORG
Entrepreneurship ORG
I would go about this using a tokenizer on both the text that you're sending to the API and the returned entities to find matches. NLTK provides that functionality out of the box with their comprehensive "word_tokenize" method (http://www.nltk.org/book/ch03.html) though any tokenizer will work as long as it tokenizes the entities the same as the text (ie: raw.split()).
# Generic tokenizer (if you don't use NLTK's)
def word_tokenize(raw):
return raw.split()
With that, you would be iterating over each word (token) in the document, checking to see if you get a match the first token in the entities returned.
for word in word_tokenize(raw):
for entity in entity_results:
if word.upper() in ( ( e.upper() for e in word_tokenize(entity[0]) ):
print(" ".join([word] + entity[1:]))
else:
print(word)
You may want to expand on this to get an exact match for the full entity, testing for the length of the token list, and testing each element by index instead.
words = word_tokenize(raw)
ents = [ [ e for e in word_tokenize(entity[0]) ] for entity in entity_results ]
for word_idx in range(len(words)):
for ent in ents:
# Check the word against the first word in the entity
if words[word_idx].upper() in ( e[0].upper() for e in ents ):
match = True
# Check all words in entity
for ent_idx in range(len(ent)):
if ent[ent_idx] != words[word_idx + ent_idx]:
match = false
break
if match:
print(" ".join([words[word_idx]] + ent))
else:
print(words[word_idx])
else:
print(words[word_idx])
You may notice though that this prints out the full entity if it matches, it will only get you matches on the first word, and that this doesn't handle IndexError problems if the line "ent[ent_idx] != words[word_idx + ent_idx]" references an invalid index. Some work is needed, depending on what you want to do with the output.
Finally, this all assumes that AlchemyAPI isn't including co-references in their final count. Co-reference is when you refer to an entity using "he", "she", "it", "they", etc. That's something you'll have to test on your own.
Edit: This code has been worked on and released as a basic module: https://github.com/hyperreality/Poetry-Tools
I'm a linguist who has recently picked up python and I'm working on a project which hopes to automatically analyze poems, including detecting the form of the poem. I.e. if it found a 10 syllable line with 0101010101 stress pattern, it would declare that it's iambic pentameter. A poem with 5-7-5 syllable pattern would be a haiku.
I'm using the following code, part of a larger script, but I have a number of problems which are listed below the program:
corpus in the script is simply the raw text input of the poem.
import sys, getopt, nltk, re, string
from nltk.tokenize import RegexpTokenizer
from nltk.util import bigrams, trigrams
from nltk.corpus import cmudict
from curses.ascii import isdigit
...
def cmuform():
tokens = [word for sent in nltk.sent_tokenize(corpus) for word in nltk.word_tokenize(sent)]
d = cmudict.dict()
text = nltk.Text(tokens)
words = [w.lower() for w in text]
regexp = "[A-Za-z]+"
exp = re.compile(regexp)
def nsyl(word):
lowercase = word.lower()
if lowercase not in d:
return 0
else:
first = [' '.join([str(c) for c in lst]) for lst in max(d[lowercase])]
second = ''.join(first)
third = ''.join([i for i in second if i.isdigit()]).replace('2', '1')
return third
#return max([len([y for y in x if isdigit(y[-1])]) for x in d[lowercase]])
sum1 = 0
for a in words:
if exp.match(a):
print a,nsyl(a),
sum1 = sum1 + len(str(nsyl(a)))
print "\nTotal syllables:",sum1
I guess that the output that I want would be like this:
1101111101
0101111001
1101010111
The first problem is that I lost the line breaks during the tokenization, and I really need the line breaks to be able to identify form. This should not be too hard to deal with though. The bigger problems are that:
I can't deal with non-dictionary words. At the moment I return 0 for them, but this will confound any attempt to identify the poem, as the syllabic count of the line will probably decrease.
In addition, the CMU dictionary often says that there is stress on a word - '1' - when there is not - '0 - . Which is why the output looks like this: 1101111101, when it should be the stress of iambic pentameter: 0101010101
So how would I add some fudging factor so the poem still gets identified as iambic pentameter when it only approximates the pattern? It's no good to code a function that identifies lines of 01's when the CMU dictionary is not going to output such a clean result. I suppose I'm asking how to code a 'partial match' algorithm.
Welcome to stack overflow. I'm not that familiar with Python, but I see you have not received many answers yet so I'll try to help you with your queries.
First some advice: You'll find that if you focus your questions your chances of getting answers are greatly improved. Your post is too long and contains several different questions, so it is beyond the "attention span" of most people answering questions here.
Back on topic:
Before you revised your question you asked how to make it less messy. That's a big question, but you might want to use the top-down procedural approach and break your code into functional units:
split corpus into lines
For each line: find the syllable length and stress pattern.
Classify stress patterns.
You'll find that the first step is a single function call in python:
corpus.split("\n");
and can remain in the main function but the second step would be better placed in its own function and the third step would require to be split up itself, and would probably be better tackled with an object oriented approach. If you're in academy you might be able to convince the CS faculty to lend you a post-grad for a couple of months and help you instead of some workshop requirement.
Now to your other questions:
Not loosing line breaks: as #ykaganovich mentioned, you probably want to split the corpus into lines and feed those to the tokenizer.
Words not in dictionary/errors: The CMU dictionary home page says:
Find an error? Please contact the developers. We will look at the problem and improve the dictionary. (See at bottom for contact information.)
There is probably a way to add custom words to the dictionary / change existing ones, look in their site, or contact the dictionary maintainers directly.
You can also ask here in a separate question if you can't figure it out. There's bound to be someone in stackoverflow that knows the answer or can point you to the correct resource.
Whatever you decide, you'll want to contact the maintainers and offer them any extra words and corrections anyway to improve the dictionary.
Classifying input corpus when it doesn't exactly match the pattern: You might want to look at the link ykaganovich provided for fuzzy string comparisons. Some algorithms to look for:
Levenshtein distance: gives you a measure of how different two strings are as the number of changes needed to turn one string into another. Pros: easy to implement, Cons: not normalized, a score of 2 means a good match for a pattern of length 20 but a bad match for a pattern of length 3.
Jaro-Winkler string similarity measure: similar to Levenshtein, but based on how many character sequences appear in the same order in both strings. It is a bit harder to implement but gives you normalized values (0.0 - completely different, 1.0 - the same) and is suitable for classifying the stress patterns. A CS postgrad or last year undergrad should not have too much trouble with it ( hint hint ).
I think those were all your questions. Hope this helps a bit.
To preserve newlines, parse line by line before sending each line to the cmu parser.
For dealing with single-syllable words, you probably want to try both 0 and 1 for it when nltk returns 1 (looks like nltk already returns 0 for some words that would never get stressed, like "the"). So, you'll end up with multiple permutations:
1101111101
0101010101
1101010101
and so forth. Then you have to pick ones that look like a known forms.
For non-dictionary words, I'd also fudge it the same way: figure out the number of syllables (the dumbest way would be by counting the vowels), and permutate all possible stresses. Maybe add some more rules like "ea is a single syllable, trailing e is silent"...
I've never worked with other kinds of fuzzying, but you can check https://stackoverflow.com/questions/682367/good-python-modules-for-fuzzy-string-comparison for some ideas.
This is my first post on stackoverflow.
And I'm a python newbie, so please excuse any deficits in code style.
But I too am attempting to extract accurate metre from poems.
And the code included in this question helped me, so I post what I came up with that builds on that foundation. It is one way to extract the stress as a single string, correct with a 'fudging factor' for the cmudict bias, and not lose words that are not in the cmudict.
import nltk
from nltk.corpus import cmudict
prondict = cmudict.dict()
#
# parseStressOfLine(line)
# function that takes a line
# parses it for stress
# corrects the cmudict bias toward 1
# and returns two strings
#
# 'stress' in form '0101*,*110110'
# -- 'stress' also returns words not in cmudict '0101*,*1*zeon*10110'
# 'stress_no_punct' in form '0101110110'
def parseStressOfLine(line):
stress=""
stress_no_punct=""
print line
tokens = [words.lower() for words in nltk.word_tokenize(line)]
for word in tokens:
word_punct = strip_punctuation_stressed(word.lower())
word = word_punct['word']
punct = word_punct['punct']
#print word
if word not in prondict:
# if word is not in dictionary
# add it to the string that includes punctuation
stress= stress+"*"+word+"*"
else:
zero_bool=True
for s in prondict[word]:
# oppose the cmudict bias toward 1
# search for a zero in array returned from prondict
# if it exists use it
# print strip_letters(s),word
if strip_letters(s)=="0":
stress = stress + "0"
stress_no_punct = stress_no_punct + "0"
zero_bool=False
break
if zero_bool:
stress = stress + strip_letters(prondict[word][0])
stress_no_punct=stress_no_punct + strip_letters(prondict[word][0])
if len(punct)>0:
stress= stress+"*"+punct+"*"
return {'stress':stress,'stress_no_punct':stress_no_punct}
# STRIP PUNCTUATION but keep it
def strip_punctuation_stressed(word):
# define punctuations
punctuations = '!()-[]{};:"\,<>./?##$%^&*_~'
my_str = word
# remove punctuations from the string
no_punct = ""
punct=""
for char in my_str:
if char not in punctuations:
no_punct = no_punct + char
else:
punct = punct+char
return {'word':no_punct,'punct':punct}
# CONVERT the cmudict prondict into just numbers
def strip_letters(ls):
#print "strip_letters"
nm = ''
for ws in ls:
#print "ws",ws
for ch in list(ws):
#print "ch",ch
if ch.isdigit():
nm=nm+ch
#print "ad to nm",nm, type(nm)
return nm
# TESTING results
# i do not correct for the '2'
line = "This day (the year I dare not tell)"
print parseStressOfLine(line)
line = "Apollo play'd the midwife's part;"
print parseStressOfLine(line)
line = "Into the world Corinna fell,"
print parseStressOfLine(line)
"""
OUTPUT
This day (the year I dare not tell)
{'stress': '01***(*011111***)*', 'stress_no_punct': '01011111'}
Apollo play'd the midwife's part;
{'stress': "0101*'d*01211***;*", 'stress_no_punct': '010101211'}
Into the world Corinna fell,
{'stress': '01012101*,*', 'stress_no_punct': '01012101'}