Interpreting minimum_should_match for elasticsearch - python

I'm very new to elasticsearch query, I'm hoping to get some clarification with the following query of an existing source code that I'm looking at.
body["query"]["bool"]["should"] = [
{"match": {"categories": {"query": raw_query, "operator": "and"}}},
....,
....,
{"match": {"all": {"query": raw_query, "minimum_should_match": "50%" if keywords else "2<80%"}}}
]
body["query"]["bool"]["minimum_should_match"] = 1
My understanding of minimum_should_match is that it specifies the number of minimum matching words to the query. For example, for the following, any 2 of the 3 words young, transformation and Egyptian satisfies as a match for the description field.
"query":{
"match":{
"description":{
"query" : "young transformation Egyptian",
"minimum_should_match" : 2
}
}
}
From the source code I understand "minimum_should_match": "50%" that means as long as half of the words in raw_query matches what is in the field all if there are keywords. What confuses me a bit is 2<80%. I've read the docs but I'm still confused.
From the docs, https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-minimum-should-match.html, it gave an example of 3<90% and says:
if there are 1 to 3 clauses they are all required, but for 4 or more clauses only 90% are required.
What exactly is the clause? I would think the clause is every match statement in this case but from the source code, this is placed within a single match clause. In that case, how can it ever have more than one clause? My understanding is obviously incorrect.
The last part I need confirmation on is:
body["query"]["bool"]["minimum_should_match"] = 1
Since it is placed outside of should, does that mean only a single match from body["query"]["bool"]["should"] is required?

My understanding of minimum_should_match is that it specifies the number of minimum matching words to the query.
Not always. Here minimum_should_match applies not to specific full-text queries but to the bool query and controls how many of should clauses should trigger.
If minimum_should_match is applied to a specific match query then yes, it will control how many tokens (words) should match.

Related

How to get the rows extracted if few columns are blank or empty

I am new to this and would need some help in extracting the records/row only when few columns are blank. Below code is ignoring the blank records and getting me the ones with value. Can someone suggest here ?
mongo_docs = mongo.db.user.find({"$and":[{"Param1": {"$ne":None}}, {"Param1": {"$ne": ""}}]})
The query you are using contains ne which stands for not-equal. You can change that to eq (equals) and check if you get the desired results.
mongo_docs = mongo.db.user.find({"$or":[{"Param1": {"$eq":None}}, {"Param1": {"$eq": ""}}]})
A simplification of the above code will also be:
mongo_docs = mongo.db.user.find({"$or":[{"Param1": {"$eq":None, "$eq": ""}}]})
You can also use the exists query if that better satisfies your requirement.
From the comments: You are absolutely right, the command will work with multiple parameters.
mongo_docs = mongo.db.user.find({"$or":[{"Param1": {"$eq":None, "$eq": ""}}, {"Param2": {"$eq":None, "$eq": ""}},{"Param3": {"$eq":None, "$eq": ""}}]})
Additionally, if you want to do it over a larger range, you can consider using text indexes. These will allow you to search all the test fields at once, and the code for the same should look something like this:
mongo.db.user.createIndex({Param1: "text", Param2: "text", Param3: "text"})
mongo.db.user.find({$text: {$search: ""}})
The above works for text content only, and I have not come across any implementation for integer values yet but cannot see why the same should not work with some minor changes using other wildcards.
References:
Search multiple fields for multiple values in MongoDB
https://docs.mongodb.com/manual/text-search/
https://docs.mongodb.com/manual/core/index-text/#std-label-text-index-compound
https://docs.mongodb.com/manual/core/index-wildcard/

How to improve query accuracy of Easticsearch from Python?

How can you improve the accuracy search results from Elasticsearch using the Python wrapper? My basic example returns results, but the results are very inaccurate.
I'm running Elasticsearch 5.2 on Ubuntu 16, and I start by creating my index and adding a few documents like:
es = Elasticsearch()
# Document A
es.index(
index='my-test-index',
doc_type='text',
body=dict(
search_key='some specific keywords',
weight=1.0,
data='blah1',
),
)
# Document B
es.index(
index='my-test-index',
doc_type='text',
body=dict(
search_key='some other specific keywords',
weight=1.0,
data='blah2',
),
)
# Document C
es.index(
index='my-test-index',
doc_type='text',
body=dict(
search_key='some other very long text that is very different yet mentions the word specific and keywords',
weight=1.0,
data='blah3',
),
)
I then query it with:
es = Elasticsearch()
es.indices.create(index='my-test-index', ignore=400)
query = 'some specific keywords'
results = es.search(
index='my-test-index',
body={
'query':{
"function_score": {
"query": {
"match": {
"search_key": query
}
},
"functions": [{
"script_score": {
"script": "doc['weight'].value"
}
}],
"score_mode": "multiply"
}
},
}
)
And although it returns all results, it returns them in the order of documents B, C, A, whereas I would expect them in the order A, B, C, because although all the documents contain all my keywords, only the first one is an exact match. I would expect C to be last because, even though it contains all my keywords, it also contains a lot of fluff I'm not explicitly searching for.
This problem compounds when I index more entries. The search returns everything that has even a single keyword from my query, and seemingly weights them all identically, causing the search results get less and less accurate the larger my index grows.
This is making Elasticsearch almost useless. Is there anyway I can fix it? Is there a problem with my search() call?
In your query, you can use a match_phrase query instead of a match query so that the order and proximity of the search terms get into the mix. Additionally, you can add a small slop in order to allow the terms to be further apart or in a different order. But documents with terms in the same order and closer apart will be ranked higher than documents with terms out of order and/or further apart. Try it out
"query": {
"match_phrase": {
"search_key": query,
"slop": 10
}
},
Note: slop is a number that indicates how many "swaps" of the search terms you need to perform in order to land on the term configuration present in the document.
Sorry for not reading your question more carefully and for the loaded answer below. I don't want to a stick in the mud but I think it will be clearer if you understand a bit more how Elasticsearch itself works.
Because you index your document without specifying any index and mapping configuration, Elasticsearch will use several defaults that it provides out of the box. The indexing process will first tokenize field values in your document using the standard tokenizer and analyze them using the standard analyzer before storing them in the index. Both the standard tokenizer and analyzer work by splitting your string based on word boundary. So at the end of index time, what you have in your index for the terms in the search_key field are ["some", "specific", "keywords"], not "some specific keywords".
During search time, the match query controls relevance using a similarity algorithm called term frequency/inverse document frequency, or TF/IDF. This algorithm is very popular in text search in general and there is a wikipedia section on it: https://en.wikipedia.org/wiki/Tf%E2%80%93idf. What's important to note here is that the more frequently your term appear in the index, the less important it is in terms of relevance. some, specific, and keywords appear in ALL 3 documents in your index, so as far as elasticsearch is concerned, they contribute very little to the document's relevance in your search result. Since A contains only these terms, it's like having a document containing only the, an, a in an English index. It won't show up as first result even if you search for the, an, a specifically. B ranks higher than C because B is shorter, which yields higher norm value. This norm value is explained in the relevance document. This is a bit of a speculation on my part, but I think it does work out this way if you explain the query using the explain API.
So, back to your need, how to favor exact match over everything else? There is, of course, the match_phrase query as Val pointed out. Another popular method to do it, which I personally prefer, is to index the raw value in a nested field called search_key.raw using the not_analyzed option when defining your mapping: https://www.elastic.co/guide/en/elasticsearch/guide/current/mapping-intro.html#_index_2 and simply boost this raw value when you search.

Find documents which contain a particular value - Mongo, Python

I'm trying to add a search option to my website but it doesn't work. I looked up solutions but they all refer to using an actual string, whereas in my case I'm using a variable, and I can't make those solutions work. Here is my code:
cursor = source.find({'title': search_term}).limit(25)
for document in cursor:
result_list.append(document)
Unfortunately this only gives back results which match the search_term variable's value exactly. I want it to give back any results where the title contains the search term - regardless what other strings it contains. How can I do it if I want to pass a variable to it, and not an actual string? Thanks.
You can use $regex to do contains searches.
cursor = collection.find({'field': {'$regex':'regular expression'}})
And to make it case insensitive:
cursor = collection.find({'field': {'$regex':'regular expression', '$options'‌​:'i'}})
Please try cursor = source.find({'title': {'$regex':search_term}}).limit(25)
$text
You can perform a text search using $text & $search. You first need to set a text index, then use it:
$ db.docs.createIndex( { title: "text" } )
$ db.docs.find( { $text: { $search: "search_term" } } )
$regex
You may also use $regex, as answered here: https://stackoverflow.com/a/10616781/641627
$ db.users.findOne({"username" : {$regex : ".*son.*"}});
Both solutions compared
Full Text Search vs. Regular Expressions
... The regular expression search takes longer for queries with just a
few results while the full text search gets faster and is clearly
superior in those cases.

mongo db findOne and $or does order of arguments matters or hierarchy? [performance]

I mean for example while having two conditions: if first condition true will it avoid checking the second?
doc = collection.find_one(
{'$or': [
{
'k': kind,
'i': int(pk)
},
{
'children.k': kind,
'children.i': int(pk)
}
]
}, { '_id': False})
I would like that it stops further searching when matching first condition, so it not goes lower level to search around children.
Is it matter of arguments order in $OR closure or rather mongodb knows smartly the hierarchy and it influences order of search by findOne?
Yes the order matters which is one strong reason for an array form of arguments which of course is ordered.
So basically this is referred to as "short circuit" evaluation. So only where the first condition does not match then is the next condition tested and so on.
So can best demonstrate with a collection like this:
{ "a": 1 },
{ "a": 2, "b": 1 }
And then the following query:
db.collection.find({ "$or": [ { "a": 1 }, { "b": 1 } ] })
Which of course finds both documents since even though the first does not have an element for "b" the first condition is met anyway. In the second document since the first failed then the second was used to match.
I would like that it stops further searching when matchin first condition, so it not goes lower level to search around children.
The question you have to ask yourself is: How can MongoDB know how both sides of the $or are satisfied by one side? How does MongoDB know that the documents that do not satisfy the first condition do not satisfy the second?
If I were to say that I have a set of documents, half with {a:1,b:1} and half with {b:2} how can you know that a:1 OR b:1 is satisfied by the first half if you have no idea what the second half look like?
Simple answer is that it doesn't. It has to search both conditions (via parallel queries which are then returned and duplicates merged) as such order does not really matter unless it were an $and and in this case the importance of order is in the index not the query as the query will be moved around to optimise for quickest path to results.
So in reality the way MongoDB works is that it shoots off a "query" per condition. This actually explains the behaviour: http://docs.mongodb.org/manual/reference/operator/query/or/#behaviors
When using indexes with $or queries, each clause of an $or can use its own index.

The same query on Sparql gives different results

I read some questions related to my question, like
Same sparql not returning same results, but I think is a little different.
Consider this query which I submit into http://live.dbpedia.org/sparql (Virtuoso endpoint) and get 34 triples as a result. Result Sparql
SELECT ?pred ?obj
WHERE {
<http://dbpedia.org/resource/Johann_Sebastian_Bach> ?pred ?obj
FILTER((langMatches(lang(?obj), "")) ||
(langMatches(lang(?obj), "EN"))
)
}
Then, I used the same query in a code in python:
import rdflib
import rdfextras
rdfextras.registerplugins()
g=rdflib.Graph()
g.parse("http://dbpedia.org/resource/Johann_Sebastian_Bach")
PREFIX = """
PREFIX dbp: <http://dbpedia.org/resource/>
"""
query = """
SELECT ?pred ?obj
WHERE {dbp:Johann_Sebastian_Bach ?pred ?obj
FILTER( (langMatches(lang(?obj), "")) ||
(langMatches(lang(?obj), "EN")))}
"""
query = PREFIX + query
result_set = g.query(query)
print len(result_set)
This time, I get only 27 triples! https://dl.dropboxusercontent.com/u/22943656/result.txt
I thought it could be related to the dbpedia site. I repeated these queries several time and always got the same difference. Therefore, I downloaded the RDF file to test it locally, and used the software Protége to simulate the Virtuoso endpoint. Even though, I still have different results from the sparql submitted into Protége and Python, 31 and 27. Is there any explanation for this difference? And how can I get the same result in both?
As the question is written, there are a few possible problems. Based on the comments, the first one described here (about lang, langMatches, etc.) seems to be what you're actually running into, but I'll leave the descriptions of the other possible problems, in case someone else finds them useful.
lang, langMatches, and the empty string
lang is defined to return "" for literals with no language tags. According to RFC 4647 §2.1, language tags are defined as follows:
2.1. Basic Language Range
A "basic language range" has the same syntax as an [RFC3066]
language tag or is the single character "*". The basic language
range was originally described by HTTP/1.1 [RFC2616] and later
[RFC3066]. It is defined by the following ABNF [RFC4234]:
language-range = (1*8ALPHA *("-" 1*8alphanum)) / "*"
alphanum = ALPHA / DIGIT
This means that "" isn't actually a legal language tag. As Jeen Broekstra pointed out on answers.semanticweb.com, the SPARQL recommendation says:
17.2 Filter Evaluation
SPARQL provides a subset of the functions and operators defined by
XQuery Operator Mapping. XQuery 1.0 section 2.2.3 Expression
Processing describes the invocation of XPath functions. The following
rules accommodate the differences in the data and execution models
between XQuery and SPARQL: …
Functions invoked with an
argument of the wrong type will produce a type error. Effective
boolean value arguments (labeled "xsd:boolean (EBV)" in the operator
mapping table below), are coerced to xsd:boolean using the EBV rules
in section 17.2.2.
Since "" isn't a legal language tag, it might be considered "an argument of the wrong type [that] will produce a type error." In that case, the langMatches invocation would produce an error, and that error will be treated as false in the filter expression. Even if it doesn't return false for this reason, RFC 4647 §3.3.1, which describes how language tags and ranges are compared, doesn't say exactly what should happen in the comparison, since it's assuming legal language tags:
Basic filtering compares basic language ranges to language tags. Each
basic language range in the language priority list is considered in
turn, according to priority. A language range matches a particular
language tag if, in a case-insensitive comparison, it exactly equals
the tag, or if it exactly equals a prefix of the tag such that the
first character following the prefix is "-". For example, the
language-range "de-de" (German as used in Germany) matches the
language tag "de-DE-1996" (German as used in Germany, orthography of
1996), but not the language tags "de-Deva" (German as written in the
Devanagari script) or "de-Latn-DE" (German, Latin script, as used in
Germany).
Based on your comments and my local experiments, it appears that langMatches(lang(?obj),"") for literals without language tags (so really, langMatches("","")) is returning true in Virtuoso (as it's installed on DBpedia), Jena's ARQ (from my experiments), and Proégé (from our experiments), and it's returning false (or an error that's coerced to false) in RDFlib.
In either case, since lang is defined to return "" for the literals without a language tag, , you should be able to reliably include them in your results by changing langMatches(lang(?obj),"") with lang(?obj) = "".
Issues with the data that you're using
You're not querying the same data. The data that you download from
http://dbpedia.org/resource/Johann_Sebastian_Bach
is from DBpedia, but when you run a query against
http://live.dbpedia.org/sparql,
you're running it against DBpedia Live, which may have different data. If you run this query on the DBpedia Live endpoint and on the DBpedia endpoint, you get a different number of results:
SELECT count(*) WHERE {
dbpedia:Johann_Sebastian_Bach ?pred ?obj
FILTER( langMatches(lang(?obj), "") || langMatches(lang(?obj), "EN" ) )
}
DBpedia Live results 31
DBpedia results 34
Issues with distinct
Another possible problem, though it doesn't seem to be the one that you're running into, is that your second query has a distinct modifier, but your first one doesn't. That means that your second query could easily have fewer results than the first one.
If you run this query against the DBpedia SPARQL endpoint you should get 34 results, and that's the same whether or not you use the distinct modifiers, and it's the number that you should get if you download the data and run the same query against it.
select ?pred ?obj where {
dbpedia:Johann_Sebastian_Bach ?pred ?obj
filter( langMatches(lang(?obj), "") || langMatches(lang(?obj), "EN") )
}
SPARQL results

Categories