Generating search term suggestions with Whoosh?

Generating search term suggestions with Whoosh? - python

I've got a set of documents in a Whoosh index, and I want to provide a search term suggestion feature. So If you type "pop", some suggestions that could come up might be:
popcorn
popular
pope
Poplar Film
pop culture
I've got the terms that should be coming up as suggestions going into an NGRAMWORDS field in my index, but when I do a query on that field I get autocompleted results rather than the expanded suggestions - so I get documents tagged with "pop culture", but no way to show that term to the user.
(For comparison, I'd do this in ElasticSearch using a completion mapping on that field and then use the _suggest endpoint to get the suggestions.)
I can only find examples for autocomplete or spelling correction in the documentation or elsewhere on on the web. Is there any way I can get search term suggestions from my index with Whoosh?
Edit:
expand_prefix was a much-needed pointer in the right direction. I've ended up using a KEYWORD(commas=True, lowercase=True) for my suggest field, and code like this to get suggestions in most-common-first order (expand_prefix and iter_prefix will yield them in alphabetical order):
def get_suggestions(term):
with ix.reader() as r:
suggestions = [(s[0], s[1].doc_frequency()) for s in r.iter_prefix('suggest', term)]
return sorted(suggestions, key=itemgetter(1), reverse=True)

Term Frequency Functions
I want to add to the answers here that there is actually a builtin function in whoosh that returns the top 'number' terms by term frequency. It is in the whoosh docs.
whoosh.reading.IndexReader.most_frequent_terms(fieldname, number=5, prefix='')
tf-idf vs. frequency
Also, on the same page of the docs, right above the previous function in the whoosh docs is a function that returns the most distinctive terms rather than the most frequent. It uses the tf-idf score, which is effective at eliminating common but insignificant words like 'the'. This could be more or less useful depending on what you are looking for. it is appropriately named:
whoosh.reading.IndexReader.most_distinctive_terms(fieldname, number=5, prefix='')
Each of these would be used in this fashion:
with ix.reader() as r:
print r.most_frequent_terms('suggestions', number=5, prefix='pop')
print r.most_distinctive_terms('suggestions', number=5, prefix='pop')
Multi-Word Suggestions
As well, I have had problems with multi-word suggestions. My solution was to create a schema in the following way:
fields.Schema(suggestions = fields.TEXT(),
suggestion_phrases = fields.KEYWORD(commas=True, lowercase=True)
In the suggestion_phrases field, commas=True allows keywords to be stored with spaces and therefore have multiple words, and lowercase=True ignores capitalization (This can be removed if it is necessary to distinguish between capitalized and non-capitalized terms). Then, in order to get both single and multi-word suggestions, you would run either most_frequent_terms() or most_distinctive_terms() on both fields. Then combine the results.

This is not what you are looking for exactly, but probably can help you:
reader = index.reader()
for x in r.expand_prefix('title', 'pop'):
print x
Output example:
pop
popcorn
popular
Update
Another workaround is to build another index with keywords as TEXT only. And play with search language. What I could achieve:
In [12]: list(ix.searcher().search(qp.parse('pop*')))
Out[12]:
[<Hit {'keywords': u'popcorn'}>,
<Hit {'keywords': u'popular'}>,
<Hit {'keywords': u'pope'}>,
<Hit {'keywords': u'Popular Film'}>,
<Hit {'keywords': u'pop culture'}>]

Related

How to find the root of a word from its present participle or other variations in Python?

I'm working on a NLP project, and right now, I'm stuck on detecting antonyms for certain phrases that aren't in their "standard" forms (like verbs, adjectives, nouns) instead of present-participles, past tense, or something to that effect. For instance, if I have the phrase "arriving" or "arrived", I need to convert it to "arrive". Similarly, "came" should be "come". Lastly, “dissatisfied” should be “dissatisfy”. Can anyone help me out with this? I have tried several stemmers and lemmanizers in NLTK with Python, to no avail; most of them don’t produce the correct root. I’ve also thought about the ConceptNet semantic network and other dictionary APIs, but it seems far too complicated for what I need. Any advice is helpful. Thanks!

If you know you'll be working with a limited set, you could create a dictionary.
Example :
look_up = {'arriving' : 'arrive',
'arrived' : 'arrive',
'came' : 'come',
'dissatisfied' : 'dissatisfy'}
test = 'arrived'
print (look_up [test])

Extract keywords/phrases from a given short text using python and its libraries

From a user given input of job description, i need to extract the keywords or phrases, using python and its libraries. I am open for suggestions and guidance from the community of what libraries work best and if in case, its simple, please guide through.
Example of user input:
user_input = "i want a full stack developer. Specialization in python is a must".
Expected output:
keywords = ['full stack developer', 'python']

Well, a good keywords set is a good method. But, the key is how to build it. There are many way to do it.
Firstly, the simplest one is searching open keywords set in the web. It's depend on your luck and your knowledge. Your keywords (likes "python, java, machine learing") are common tags in Stackoverflow, Recruitment websites. Don't break the law!
The second one is IR(Information Extraction), it's more complex than the last one. There are many algorithms, likes "TextRank", "Entropy", "Apriori", "HMM", "Tf-IDF", "Conditional Random Fields", and so on.
Good lucky.
For matching keywords/phases, Trie Tree is more faster.

Well, i answered my own question. Thanks anyways for those who replied.
keys = ['python', 'full stack developer','java','machine learning']
keywords = []
for i in range(len(keys)):
word = keys[i]
if word in keys:
keywords.append(word)
else:
continue
print(keywords)
Output was as expected!

Text processing and detection from a specific dictionary in python

I have a text in English that I want to process to detect specific entries that I have in another dictionary in Python (example entry: mass spectroscopy). Those entries are very important as they need to be matched for later annotation. In order to do that I need to either add many forms of each entry (like plurals, acronyms etc.) or find a way to do the intelligent processing . Not only does the brute approach take much more time (for me), but I might not be able to resolve all the situations (I want mass spectroscopy, possibly spectroscopy, but not mass). I am not looking for a solutions, I just need guidelines how to approach the problem and which toolkit to use. The dictionary is growing and an intelligent approach would be preferred.
I have found NLTK in Python, but I am not sure how to use my dict in addition or instead of the built-in corpora.
Example - I have a sentence:
[u'Liquid', u'biopsies', u'based', u'on', u'circulating', u'cell-free', u'DNA', u'(cfDNA)', u'analysis', u'are', u'described', u'as', u'surrogate', u'samples', u'for', u'molecular', u'analysis.']
I have a dict with {'Liquid biopsy':['Blood for analysis'],'cfDNA':['Blood for analysis']}. The arrays are used on purpose so they are both the same object, thus trying to create aliases in a dict.
How do I match my entries to the text?
Thanks in advance!

if i didn't misunderstand you, you want to check the dictionary items with the list items. Then print the results to the console.
dict_1={"Liquid Biopsy":"Blood for analysis","cfDNA":"Blood for analysis","Liquid Biopsies":"Blood for analysis"}
list_1=[u'Liquid', u'biopsies', u'based', u'on', u'circulating', u'cell-free', u'DNA', u'(cfDNA)', u'analysis', u'are', u'described', u'as', u'surrogate', u'samples', u'for', u'molecular', u'analysis.']
string_1=" ".join(list_1).lower()
for i in dict_1:
if i.lower() in string_1:
print("Key: {}\nValue: {}\n".format(i,dict_1[i]))
I used the above codes and the console printed the below results.
Key: Liquid Biopsies
Value: Blood for analysis
Key: cfDNA
Value: Blood for analysis
Process finished with exit code 0

How to improve query accuracy of Easticsearch from Python?

How can you improve the accuracy search results from Elasticsearch using the Python wrapper? My basic example returns results, but the results are very inaccurate.
I'm running Elasticsearch 5.2 on Ubuntu 16, and I start by creating my index and adding a few documents like:
es = Elasticsearch()
# Document A
es.index(
index='my-test-index',
doc_type='text',
body=dict(
search_key='some specific keywords',
weight=1.0,
data='blah1',
),
)
# Document B
es.index(
index='my-test-index',
doc_type='text',
body=dict(
search_key='some other specific keywords',
weight=1.0,
data='blah2',
),
)
# Document C
es.index(
index='my-test-index',
doc_type='text',
body=dict(
search_key='some other very long text that is very different yet mentions the word specific and keywords',
weight=1.0,
data='blah3',
),
)
I then query it with:
es = Elasticsearch()
es.indices.create(index='my-test-index', ignore=400)
query = 'some specific keywords'
results = es.search(
index='my-test-index',
body={
'query':{
"function_score": {
"query": {
"match": {
"search_key": query
}
},
"functions": [{
"script_score": {
"script": "doc['weight'].value"
}
}],
"score_mode": "multiply"
}
},
}
)
And although it returns all results, it returns them in the order of documents B, C, A, whereas I would expect them in the order A, B, C, because although all the documents contain all my keywords, only the first one is an exact match. I would expect C to be last because, even though it contains all my keywords, it also contains a lot of fluff I'm not explicitly searching for.
This problem compounds when I index more entries. The search returns everything that has even a single keyword from my query, and seemingly weights them all identically, causing the search results get less and less accurate the larger my index grows.
This is making Elasticsearch almost useless. Is there anyway I can fix it? Is there a problem with my search() call?

In your query, you can use a match_phrase query instead of a match query so that the order and proximity of the search terms get into the mix. Additionally, you can add a small slop in order to allow the terms to be further apart or in a different order. But documents with terms in the same order and closer apart will be ranked higher than documents with terms out of order and/or further apart. Try it out
"query": {
"match_phrase": {
"search_key": query,
"slop": 10
}
},
Note: slop is a number that indicates how many "swaps" of the search terms you need to perform in order to land on the term configuration present in the document.

Sorry for not reading your question more carefully and for the loaded answer below. I don't want to a stick in the mud but I think it will be clearer if you understand a bit more how Elasticsearch itself works.
Because you index your document without specifying any index and mapping configuration, Elasticsearch will use several defaults that it provides out of the box. The indexing process will first tokenize field values in your document using the standard tokenizer and analyze them using the standard analyzer before storing them in the index. Both the standard tokenizer and analyzer work by splitting your string based on word boundary. So at the end of index time, what you have in your index for the terms in the search_key field are ["some", "specific", "keywords"], not "some specific keywords".
During search time, the match query controls relevance using a similarity algorithm called term frequency/inverse document frequency, or TF/IDF. This algorithm is very popular in text search in general and there is a wikipedia section on it: https://en.wikipedia.org/wiki/Tf%E2%80%93idf. What's important to note here is that the more frequently your term appear in the index, the less important it is in terms of relevance. some, specific, and keywords appear in ALL 3 documents in your index, so as far as elasticsearch is concerned, they contribute very little to the document's relevance in your search result. Since A contains only these terms, it's like having a document containing only the, an, a in an English index. It won't show up as first result even if you search for the, an, a specifically. B ranks higher than C because B is shorter, which yields higher norm value. This norm value is explained in the relevance document. This is a bit of a speculation on my part, but I think it does work out this way if you explain the query using the explain API.
So, back to your need, how to favor exact match over everything else? There is, of course, the match_phrase query as Val pointed out. Another popular method to do it, which I personally prefer, is to index the raw value in a nested field called search_key.raw using the not_analyzed option when defining your mapping: https://www.elastic.co/guide/en/elasticsearch/guide/current/mapping-intro.html#_index_2 and simply boost this raw value when you search.

Django-Haystack with Solr contains search

I am using haystack within a project using solr as the backend. I want to be able to perform a contains search, similar to the Django .filter(something__contains="...")
The __startswith option does not suit our needs as it, as the name suggests, looks for words that start with the string.
I tried to use something like *keyword* but Solr does not allow the * to be used as the first character
Thanks.

To get "contains" functionallity you can use:
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="100" side="back"/>
<filter class="solr.LowerCaseFilterFactory" />
as index analyzer.
This will create ngrams for every whitespace separated word in your field. For example:
"Index this!" => x, ex, dex, ndex, index, !, s!, is!, his!, this!
As you see this will expand your index greatly but if you now enter a query like:
"nde*"
it will match "ndex" giving you a hit.
Use this approach carefully to make sure that your index doesn't get too large. If you increase minGramSize, or decrease maxGramSize it will not expand the index as mutch but reduce the "contains" functionallity. For instance setting minGramSize="3" will require that you have at least 3 characters in your contains query.

You can achieve the same behavior without having to touch the solr schema. In your index, make your text field an EdgeNgramField instead of a CharField. Under the hood this will generate a similar schema to what lindstromhenrik suggested.

I am using an expression like:
.filter(something__startswith='...')
.filter_or(name=''+s'...')
as is seems solr does not like expression like '...*', but combined with or will do

None of the answers here do a real substring search *keyword*.
They don't find the keyword that is part of a bigger string, (not a prefix or suffix).
Using EdgeNGramFilterFactory or the EdgeNgramField in the indexes can only do a "startswith" or a "endswith" type of filtering.
The solution is to use a NgramField like this:
class MyIndex(indexes.SearchIndex, indexes.Indexable):
...
field_to_index= indexes.NgramField(model_attr='field_name')
...
This is very elegant, because you don't need to manually add anything to the schema.xml

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.