How to improve query accuracy of Easticsearch from Python?

How to improve query accuracy of Easticsearch from Python? - python

How can you improve the accuracy search results from Elasticsearch using the Python wrapper? My basic example returns results, but the results are very inaccurate.
I'm running Elasticsearch 5.2 on Ubuntu 16, and I start by creating my index and adding a few documents like:
es = Elasticsearch()
# Document A
es.index(
index='my-test-index',
doc_type='text',
body=dict(
search_key='some specific keywords',
weight=1.0,
data='blah1',
),
)
# Document B
es.index(
index='my-test-index',
doc_type='text',
body=dict(
search_key='some other specific keywords',
weight=1.0,
data='blah2',
),
)
# Document C
es.index(
index='my-test-index',
doc_type='text',
body=dict(
search_key='some other very long text that is very different yet mentions the word specific and keywords',
weight=1.0,
data='blah3',
),
)
I then query it with:
es = Elasticsearch()
es.indices.create(index='my-test-index', ignore=400)
query = 'some specific keywords'
results = es.search(
index='my-test-index',
body={
'query':{
"function_score": {
"query": {
"match": {
"search_key": query
}
},
"functions": [{
"script_score": {
"script": "doc['weight'].value"
}
}],
"score_mode": "multiply"
}
},
}
)
And although it returns all results, it returns them in the order of documents B, C, A, whereas I would expect them in the order A, B, C, because although all the documents contain all my keywords, only the first one is an exact match. I would expect C to be last because, even though it contains all my keywords, it also contains a lot of fluff I'm not explicitly searching for.
This problem compounds when I index more entries. The search returns everything that has even a single keyword from my query, and seemingly weights them all identically, causing the search results get less and less accurate the larger my index grows.
This is making Elasticsearch almost useless. Is there anyway I can fix it? Is there a problem with my search() call?

In your query, you can use a match_phrase query instead of a match query so that the order and proximity of the search terms get into the mix. Additionally, you can add a small slop in order to allow the terms to be further apart or in a different order. But documents with terms in the same order and closer apart will be ranked higher than documents with terms out of order and/or further apart. Try it out
"query": {
"match_phrase": {
"search_key": query,
"slop": 10
}
},
Note: slop is a number that indicates how many "swaps" of the search terms you need to perform in order to land on the term configuration present in the document.

Sorry for not reading your question more carefully and for the loaded answer below. I don't want to a stick in the mud but I think it will be clearer if you understand a bit more how Elasticsearch itself works.
Because you index your document without specifying any index and mapping configuration, Elasticsearch will use several defaults that it provides out of the box. The indexing process will first tokenize field values in your document using the standard tokenizer and analyze them using the standard analyzer before storing them in the index. Both the standard tokenizer and analyzer work by splitting your string based on word boundary. So at the end of index time, what you have in your index for the terms in the search_key field are ["some", "specific", "keywords"], not "some specific keywords".
During search time, the match query controls relevance using a similarity algorithm called term frequency/inverse document frequency, or TF/IDF. This algorithm is very popular in text search in general and there is a wikipedia section on it: https://en.wikipedia.org/wiki/Tf%E2%80%93idf. What's important to note here is that the more frequently your term appear in the index, the less important it is in terms of relevance. some, specific, and keywords appear in ALL 3 documents in your index, so as far as elasticsearch is concerned, they contribute very little to the document's relevance in your search result. Since A contains only these terms, it's like having a document containing only the, an, a in an English index. It won't show up as first result even if you search for the, an, a specifically. B ranks higher than C because B is shorter, which yields higher norm value. This norm value is explained in the relevance document. This is a bit of a speculation on my part, but I think it does work out this way if you explain the query using the explain API.
So, back to your need, how to favor exact match over everything else? There is, of course, the match_phrase query as Val pointed out. Another popular method to do it, which I personally prefer, is to index the raw value in a nested field called search_key.raw using the not_analyzed option when defining your mapping: https://www.elastic.co/guide/en/elasticsearch/guide/current/mapping-intro.html#_index_2 and simply boost this raw value when you search.

Related

Interpreting minimum_should_match for elasticsearch

I'm very new to elasticsearch query, I'm hoping to get some clarification with the following query of an existing source code that I'm looking at.
body["query"]["bool"]["should"] = [
{"match": {"categories": {"query": raw_query, "operator": "and"}}},
....,
....,
{"match": {"all": {"query": raw_query, "minimum_should_match": "50%" if keywords else "2<80%"}}}
]
body["query"]["bool"]["minimum_should_match"] = 1
My understanding of minimum_should_match is that it specifies the number of minimum matching words to the query. For example, for the following, any 2 of the 3 words young, transformation and Egyptian satisfies as a match for the description field.
"query":{
"match":{
"description":{
"query" : "young transformation Egyptian",
"minimum_should_match" : 2
}
}
}
From the source code I understand "minimum_should_match": "50%" that means as long as half of the words in raw_query matches what is in the field all if there are keywords. What confuses me a bit is 2<80%. I've read the docs but I'm still confused.
From the docs, https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-minimum-should-match.html, it gave an example of 3<90% and says:
if there are 1 to 3 clauses they are all required, but for 4 or more clauses only 90% are required.
What exactly is the clause? I would think the clause is every match statement in this case but from the source code, this is placed within a single match clause. In that case, how can it ever have more than one clause? My understanding is obviously incorrect.
The last part I need confirmation on is:
body["query"]["bool"]["minimum_should_match"] = 1
Since it is placed outside of should, does that mean only a single match from body["query"]["bool"]["should"] is required?

My understanding of minimum_should_match is that it specifies the number of minimum matching words to the query.
Not always. Here minimum_should_match applies not to specific full-text queries but to the bool query and controls how many of should clauses should trigger.
If minimum_should_match is applied to a specific match query then yes, it will control how many tokens (words) should match.

How to get the rows extracted if few columns are blank or empty

I am new to this and would need some help in extracting the records/row only when few columns are blank. Below code is ignoring the blank records and getting me the ones with value. Can someone suggest here ?
mongo_docs = mongo.db.user.find({"$and":[{"Param1": {"$ne":None}}, {"Param1": {"$ne": ""}}]})

The query you are using contains ne which stands for not-equal. You can change that to eq (equals) and check if you get the desired results.
mongo_docs = mongo.db.user.find({"$or":[{"Param1": {"$eq":None}}, {"Param1": {"$eq": ""}}]})
A simplification of the above code will also be:
mongo_docs = mongo.db.user.find({"$or":[{"Param1": {"$eq":None, "$eq": ""}}]})
You can also use the exists query if that better satisfies your requirement.
From the comments: You are absolutely right, the command will work with multiple parameters.
mongo_docs = mongo.db.user.find({"$or":[{"Param1": {"$eq":None, "$eq": ""}}, {"Param2": {"$eq":None, "$eq": ""}},{"Param3": {"$eq":None, "$eq": ""}}]})
Additionally, if you want to do it over a larger range, you can consider using text indexes. These will allow you to search all the test fields at once, and the code for the same should look something like this:
mongo.db.user.createIndex({Param1: "text", Param2: "text", Param3: "text"})
mongo.db.user.find({$text: {$search: ""}})
The above works for text content only, and I have not come across any implementation for integer values yet but cannot see why the same should not work with some minor changes using other wildcards.
References:
Search multiple fields for multiple values in MongoDB
https://docs.mongodb.com/manual/text-search/
https://docs.mongodb.com/manual/core/index-text/#std-label-text-index-compound
https://docs.mongodb.com/manual/core/index-wildcard/

How to paginate an aggregation pipeline result in pymongo?

I have a web app where I store some data in Mongo, and I need to return a paginated response from a find or an aggregation pipeline. I use Django Rest Framework and its pagination, which in the end just slices the Cursor object. This works seamlessly for Cursors, but aggregation returns a CommandCursor, which does not implement __getitem__().
cursor = collection.find({})
cursor[10:20] # works, no problem
command_cursor = collection.aggregate([{'$match': {}}])
command_cursor[10:20] # throws not subscriptable error
What is the reason behind this? Does anybody have an implementation for CommandCursor.__getitem__()? Is it feasible at all?
I would like to find a way to not fetch all the values when I need just a page. Converting to a list and then slicing it is not feasible for large (100k+ docs) pipeline results. There is a workaround with based on this answer, but this only works for the first few pages, and the performance drops rapidly for pages at the end.

Mongo has certain aggregation pipeline stages to deal with this, like $skip and $limit that you can use like so:
aggregation_results = list(collection.aggregate([{'$match': {}}, {'$skip': 10}, {'$limit': 10}]))
Specifically as you noticed Pymongo's command_cursor does not have implementation for __getitem__ hence the regular iterator syntax does not work as expected. I would personally recommend not to tamper with their code unless you're interested in becoming a contributer to their package.

The MongoDB cursor for find and aggregate functions in a different way since cursor result from aggregation query is a result of precessed data (in most cases) which is not the case for find-cursors as they are static and hence documents can be skipped and limitted to your will.
You can add the paginator limits as $skip and $limit stages in the aggregation pipeline.
For Example:
command_cursor = collection.aggregate([
{
"$match": {
# Match Conditions
}
},
{
"$skip": 10 # No. of documents to skip (Should be `0` for Page - 1)
},
{
"$limit": 10 # No. of documents to be displayed on your webpage
}
])

Generating search term suggestions with Whoosh?

I've got a set of documents in a Whoosh index, and I want to provide a search term suggestion feature. So If you type "pop", some suggestions that could come up might be:
popcorn
popular
pope
Poplar Film
pop culture
I've got the terms that should be coming up as suggestions going into an NGRAMWORDS field in my index, but when I do a query on that field I get autocompleted results rather than the expanded suggestions - so I get documents tagged with "pop culture", but no way to show that term to the user.
(For comparison, I'd do this in ElasticSearch using a completion mapping on that field and then use the _suggest endpoint to get the suggestions.)
I can only find examples for autocomplete or spelling correction in the documentation or elsewhere on on the web. Is there any way I can get search term suggestions from my index with Whoosh?
Edit:
expand_prefix was a much-needed pointer in the right direction. I've ended up using a KEYWORD(commas=True, lowercase=True) for my suggest field, and code like this to get suggestions in most-common-first order (expand_prefix and iter_prefix will yield them in alphabetical order):
def get_suggestions(term):
with ix.reader() as r:
suggestions = [(s[0], s[1].doc_frequency()) for s in r.iter_prefix('suggest', term)]
return sorted(suggestions, key=itemgetter(1), reverse=True)

Term Frequency Functions
I want to add to the answers here that there is actually a builtin function in whoosh that returns the top 'number' terms by term frequency. It is in the whoosh docs.
whoosh.reading.IndexReader.most_frequent_terms(fieldname, number=5, prefix='')
tf-idf vs. frequency
Also, on the same page of the docs, right above the previous function in the whoosh docs is a function that returns the most distinctive terms rather than the most frequent. It uses the tf-idf score, which is effective at eliminating common but insignificant words like 'the'. This could be more or less useful depending on what you are looking for. it is appropriately named:
whoosh.reading.IndexReader.most_distinctive_terms(fieldname, number=5, prefix='')
Each of these would be used in this fashion:
with ix.reader() as r:
print r.most_frequent_terms('suggestions', number=5, prefix='pop')
print r.most_distinctive_terms('suggestions', number=5, prefix='pop')
Multi-Word Suggestions
As well, I have had problems with multi-word suggestions. My solution was to create a schema in the following way:
fields.Schema(suggestions = fields.TEXT(),
suggestion_phrases = fields.KEYWORD(commas=True, lowercase=True)
In the suggestion_phrases field, commas=True allows keywords to be stored with spaces and therefore have multiple words, and lowercase=True ignores capitalization (This can be removed if it is necessary to distinguish between capitalized and non-capitalized terms). Then, in order to get both single and multi-word suggestions, you would run either most_frequent_terms() or most_distinctive_terms() on both fields. Then combine the results.

This is not what you are looking for exactly, but probably can help you:
reader = index.reader()
for x in r.expand_prefix('title', 'pop'):
print x
Output example:
pop
popcorn
popular
Update
Another workaround is to build another index with keywords as TEXT only. And play with search language. What I could achieve:
In [12]: list(ix.searcher().search(qp.parse('pop*')))
Out[12]:
[<Hit {'keywords': u'popcorn'}>,
<Hit {'keywords': u'popular'}>,
<Hit {'keywords': u'pope'}>,
<Hit {'keywords': u'Popular Film'}>,
<Hit {'keywords': u'pop culture'}>]

The same query on Sparql gives different results

I read some questions related to my question, like
Same sparql not returning same results, but I think is a little different.
Consider this query which I submit into http://live.dbpedia.org/sparql (Virtuoso endpoint) and get 34 triples as a result. Result Sparql
SELECT ?pred ?obj
WHERE {
<http://dbpedia.org/resource/Johann_Sebastian_Bach> ?pred ?obj
FILTER((langMatches(lang(?obj), "")) ||
(langMatches(lang(?obj), "EN"))
)
}
Then, I used the same query in a code in python:
import rdflib
import rdfextras
rdfextras.registerplugins()
g=rdflib.Graph()
g.parse("http://dbpedia.org/resource/Johann_Sebastian_Bach")
PREFIX = """
PREFIX dbp: <http://dbpedia.org/resource/>
"""
query = """
SELECT ?pred ?obj
WHERE {dbp:Johann_Sebastian_Bach ?pred ?obj
FILTER( (langMatches(lang(?obj), "")) ||
(langMatches(lang(?obj), "EN")))}
"""
query = PREFIX + query
result_set = g.query(query)
print len(result_set)
This time, I get only 27 triples! https://dl.dropboxusercontent.com/u/22943656/result.txt
I thought it could be related to the dbpedia site. I repeated these queries several time and always got the same difference. Therefore, I downloaded the RDF file to test it locally, and used the software Protége to simulate the Virtuoso endpoint. Even though, I still have different results from the sparql submitted into Protége and Python, 31 and 27. Is there any explanation for this difference? And how can I get the same result in both?

As the question is written, there are a few possible problems. Based on the comments, the first one described here (about lang, langMatches, etc.) seems to be what you're actually running into, but I'll leave the descriptions of the other possible problems, in case someone else finds them useful.
lang, langMatches, and the empty string
lang is defined to return "" for literals with no language tags. According to RFC 4647 §2.1, language tags are defined as follows:
2.1. Basic Language Range
A "basic language range" has the same syntax as an [RFC3066]
language tag or is the single character "*". The basic language
range was originally described by HTTP/1.1 [RFC2616] and later
[RFC3066]. It is defined by the following ABNF [RFC4234]:
language-range = (1*8ALPHA *("-" 1*8alphanum)) / "*"
alphanum = ALPHA / DIGIT
This means that "" isn't actually a legal language tag. As Jeen Broekstra pointed out on answers.semanticweb.com, the SPARQL recommendation says:
17.2 Filter Evaluation
SPARQL provides a subset of the functions and operators defined by
XQuery Operator Mapping. XQuery 1.0 section 2.2.3 Expression
Processing describes the invocation of XPath functions. The following
rules accommodate the differences in the data and execution models
between XQuery and SPARQL: …
Functions invoked with an
argument of the wrong type will produce a type error. Effective
boolean value arguments (labeled "xsd:boolean (EBV)" in the operator
mapping table below), are coerced to xsd:boolean using the EBV rules
in section 17.2.2.
Since "" isn't a legal language tag, it might be considered "an argument of the wrong type [that] will produce a type error." In that case, the langMatches invocation would produce an error, and that error will be treated as false in the filter expression. Even if it doesn't return false for this reason, RFC 4647 §3.3.1, which describes how language tags and ranges are compared, doesn't say exactly what should happen in the comparison, since it's assuming legal language tags:
Basic filtering compares basic language ranges to language tags. Each
basic language range in the language priority list is considered in
turn, according to priority. A language range matches a particular
language tag if, in a case-insensitive comparison, it exactly equals
the tag, or if it exactly equals a prefix of the tag such that the
first character following the prefix is "-". For example, the
language-range "de-de" (German as used in Germany) matches the
language tag "de-DE-1996" (German as used in Germany, orthography of
1996), but not the language tags "de-Deva" (German as written in the
Devanagari script) or "de-Latn-DE" (German, Latin script, as used in
Germany).
Based on your comments and my local experiments, it appears that langMatches(lang(?obj),"") for literals without language tags (so really, langMatches("","")) is returning true in Virtuoso (as it's installed on DBpedia), Jena's ARQ (from my experiments), and Proégé (from our experiments), and it's returning false (or an error that's coerced to false) in RDFlib.
In either case, since lang is defined to return "" for the literals without a language tag, , you should be able to reliably include them in your results by changing langMatches(lang(?obj),"") with lang(?obj) = "".
Issues with the data that you're using
You're not querying the same data. The data that you download from
http://dbpedia.org/resource/Johann_Sebastian_Bach
is from DBpedia, but when you run a query against
http://live.dbpedia.org/sparql,
you're running it against DBpedia Live, which may have different data. If you run this query on the DBpedia Live endpoint and on the DBpedia endpoint, you get a different number of results:
SELECT count(*) WHERE {
dbpedia:Johann_Sebastian_Bach ?pred ?obj
FILTER( langMatches(lang(?obj), "") || langMatches(lang(?obj), "EN" ) )
}
DBpedia Live results 31
DBpedia results 34
Issues with distinct
Another possible problem, though it doesn't seem to be the one that you're running into, is that your second query has a distinct modifier, but your first one doesn't. That means that your second query could easily have fewer results than the first one.
If you run this query against the DBpedia SPARQL endpoint you should get 34 results, and that's the same whether or not you use the distinct modifiers, and it's the number that you should get if you download the data and run the same query against it.
select ?pred ?obj where {
dbpedia:Johann_Sebastian_Bach ?pred ?obj
filter( langMatches(lang(?obj), "") || langMatches(lang(?obj), "EN") )
}
SPARQL results

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.