Filter using Range-Elasticsearch - python

Following is the kibana JSON of a single row,
{
"_index": "questionanswers",
"_type": "doc",
"_id": "3",
"_version": 1,
"_score": 0,
"_source": {
"question": {
"id": 3,
"text": "Your first salary",
"answer_type": "FL",
"question_type": "BQ"
},
"candidate": {
"id": 13
},
"job": {
"id": 6
},
"id": 3,
"status": "AN",
"answered_on": "2019-07-12T09:26:01+00:00",
"answer": "12222222"
},
"fields": {
"answered_on": [
"2019-07-12T09:26:01.000Z"
]
}
}
I have an sql query like,
Select * from questionanswers where question.id = 3 and answer between 1250 and 1253666
I have converted this to elasticsearch query as follows,
{
"size": 1000,
"query": {
"bool": {
"must": [
{
"term": {
"question.id":3
}
},
{
"range": {
"answer": {
"from": 1250,
"to": 1253666999,
"include_lower": true,
"include_upper": true,
"boost": 1
}
}
}
],
"adjust_pure_negative": true,
"boost": 1
}
}
}
Here answer is declared as String , But i holds Date,FLoat and String values.
"question": {
"id": 3,
"text": "Your first salary",
"answer_type": "FL",
"question_type": "BQ"
},
Here answer_type tells which type of answer it is expecting.
When I try to run this query I am not getting desired results. I am getting an empty response on this hit.
But actually, there is a row that satisfies this query.
How my elasticsearch query should be so that I can filter with
question.id = 3 , question.answer_type = "FL" and answer between 1250 and 1253666```

see your document again. The answer is a string value, and you are treating as a number in your query. So it does not work obviously.
change the mapping for this field to number.
Here is the document I indexed in a test index and ran your query again and it works
Indexing the document ( see the field answer)
POST /so-index4/_doc/1
{
"question": {
"id": 3,
"text": "Your first salary",
"answer_type": "FL",
"question_type": "BQ"
},
"candidate": {
"id": 13
},
"job": {
"id": 6
},
"id": 3,
"status": "AN",
"answered_on": "2019-07-12T09:26:01+00:00",
"answer": 12222222,
"fields": {
"answered_on": [
"2019-07-12T09:26:01.000Z"
]
}
}
and the query (same query that you provided above)
GET /so-index4/_search
{
"size": 1000,
"query": {
"bool": {
"must": [
{
"term": {
"question.id":3
}
},
{
"range": {
"answer": {
"from": 1250,
"to": 1253666999,
"include_lower": true,
"include_upper": true,
"boost": 1
}
}
}
],
"adjust_pure_negative": true,
"boost": 1
}
}
}
the result
{
"took" : 5,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 2.0,
"hits" : [
{
"_index" : "so-index4",
"_type" : "_doc",
"_id" : "1",
"_score" : 2.0,
"_source" : {
"question" : {
"id" : 3,
"text" : "Your first salary",
"answer_type" : "FL",
"question_type" : "BQ"
},
"candidate" : {
"id" : 13
},
"job" : {
"id" : 6
},
"id" : 3,
"status" : "AN",
"answered_on" : "2019-07-12T09:26:01+00:00",
"answer" : 12222222,
"fields" : {
"answered_on" : [
"2019-07-12T09:26:01.000Z"
]
}
}
}
]
}
}

Related

ElasticSearch: Retrieve field and it's normalization

I want to retrieve a field as well as it's normalized version from Elasticsearch.
Here's my index definition and data
PUT normalizersample
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 1,
"refresh_interval": "60s",
"analysis": {
"normalizer": {
"my_normalizer": {
"filter": [
"lowercase",
"german_normalization",
"asciifolding"
],
"type": "custom"
}
}
}
},
"mappings": {
"_source": {
"enabled": true
},
"properties": {
"myField": {
"type": "text",
"store": true,
"fields": {
"keyword": {
"type": "keyword",
"store": true
},
"normalized": {
"type": "keyword",
"store": true,
"normalizer": "my_normalizer"
}
}
}
}
}
}
POST normalizersample/_doc/1
{
"myField": ["Andreas", "Ämdreas", "Anders"]
}
My first approach was to use scripted fields like
GET /myIndex/_search
{
"size": 100,
"query": {
"match_all": {}
},
"script_fields": {
"keyword": {
"script": "doc['myField.keyword']"
},
"normalized": {
"script": "doc['myField.normalized']"
}
}
}
However, since myField is an array, this returns two lists of strings per ES document and each of them are sorted alphabetically. Hence, the corresponding entries might not match to each other due to the normalization.
"hits" : [
{
"_index" : "normalizersample",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.0,
"fields" : {
"de" : [
"amdreas",
"anders",
"andreas"
],
"keyword" : [
"Anders",
"Andreas",
"Ämdreas"
]
}
}
]
While I would like to retrieve [(Andreas, andreas), (Ämdreas, amdreas) (Anders, anders)] or a similar format where I can match every entry to its normalization.
The only way I found was to call Term Vectors on both fields since they contain a position field, but this seems like a huge overhead to me. (https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-termvectors.html)
Is there a simpler way to retrieve tuples with the keyword and the normalized field?
Thanks a lot!

Generating a dynamic nested JSON using for loop in python

I am newbie in Python. I have some difficulties generating a nested JSON using for loop in python. For generating a nested JSON, I got the length of dictionary on runtime and based on the dictionary length I want to generate nested JSON. eg. I got the length of dictionary is 4. The dictionary length may vary. Here is my data_dict dictionary:
data_dict = {"PHOTO_1" : {"key1" : "PHOTO_2", "key2" : "PHOTO_3", "key3" : "PHOTO_4"}, "PHOTO_2" : {"key1" : "PHOTO_1", "key2" : "PHOTO_3"},"PHOTO_3" : {"key1" : "PHOTO_2"},"PHOTO_4" : {"key1" : "PHOTO_1", "key2" : "PHOTO_2", "key3" : "PHOTO_3"}}
Expected result :
{
"Requests": [
{
"photo": {
"photoId": {
"id": "PHOTO_1"
},
"connections": {
"target": {
"id": "PHOTO_2"
}
}
},
"updateData": "connections"
},
{
"photo": {
"photoId": {
"id": "PHOTO_1"
},
"connections": {
"target": {
"id": "PHOTO_3"
}
}
},
"updateData": "connections"
},
{
"photo": {
"photoId": {
"id": "PHOTO_1"
},
"connections": {
"target": {
"id": "PHOTO_4"
}
}
},
"updateData": "connections"
},
{
"photo": {
"photoId": {
"id": "PHOTO_2"
},
"connections": {
"target": {
"id": "PHOTO_1"
},
}
},
"updateData": "connections"
},
{
"photo": {
"photoId": {
"id": "PHOTO_2"
},
"connections": {
"target": {
"id": "PHOTO_3"
},
}
},
"updateData": "connections"
},
{
"photo": {
"photoId": {
"id": "PHOTO_3"
},
"connections": {
"target": {
"id": "PHOTO_2"
},
}
},
"updateData": "connections"
},
{
"photo": {
"photoId": {
"id": "PHOTO_4"
},
"connections": {
"target": {
"id": "PHOTO_1"
},
}
},
"updateData": "connections"
},
{
"photo": {
"photoId": {
"id": "PHOTO_4"
},
"connections": {
"target": {
"id": "PHOTO_2"
},
}
},
"updateData": "connections"
},
{
"photo": {
"photoId": {
"id": "PHOTO_4"
},
"connections": {
"target": {
"id": "PHOTO_3"
},
}
},
"updateData": "connections"
}
]
}
Please help. I'm not getting how to solve this query? Please don't mark it duplicate. I have already checked all the answers and my JSON query is totally different.
The solution using itertools.permutations() function:
import itertools, json
data_dict = {"first_photo" : "PHOTO_1", "second_photo" : "PHOTO_2", "Thrid" : "PHOTO_3"}
result = {"Requests":[]}
for pair in sorted(itertools.permutations(data_dict.values(), 2)):
result["Requests"].append({"photo":{"photoId":{"id": pair[0]},
"connections":{"target":{"id": pair[1]}}},"updateData": "connections"})
print(json.dumps(result, indent=4))
The additional approach for the new input dict:
data_dict = {"PHOTO_1" : {"key1" : "PHOTO_2", "key2" : "PHOTO_3", "key3" : "PHOTO_4"}, "PHOTO_2" : {"key1" : "PHOTO_1", "key2" : "PHOTO_3"},"PHOTO_3" : {"key1" : "PHOTO_2"},"PHOTO_4" : {"key1" : "PHOTO_1", "key2" : "PHOTO_2", "key3" : "PHOTO_3"}}
result = {"Requests":[]}
for k,d in sorted(data_dict.items()):
for v in sorted(d.values()):
result["Requests"].append({"photo":{"photoId":{"id": k},
"connections":{"target":{"id": v}}},"updateData": "connections"})
print(json.dumps(result, indent=4))

Nothing happens when trying to use $project

I am new to mongodb and still sitting on the same pipeline thing. I dont understand why my usage of $project did not generate any output at all ?
def make_pipeline():
# complete the aggregation pipeline
pipeline = [
{
'$match': {
"user.statuses_count": {"$gt":99 },
"user.time_zone": "Brasilia"
}
},
{
"$group": {
"_id": "$user.id",
"followers": { "$max": "$user.followers_count" }
}
},
{
"$sort": { "followers": -1 }
},{"$project": {
"userId": "$user.id",
"screen_name": "$user.screen_name",
"retweet_count": "$retweet_count"}},
{
"$limit" : 1
}
]
Any ideas?
Try this aggregation pipeline below, it should give you the desired output.
Using Mongo shell:
Test documents (with minimum test case):
db.tweet.insert([
{
"retweet_count" : 23,
"user" : {
"time_zone" : "Brasilia",
"statuses_count" : 2475,
"screen_name" : "Catherinemull",
"followers_count" : 169,
"id" : 37486277
},
"id" : NumberLong("22819398300")
},
{
"retweet_count" : 7,
"user" : {
"time_zone" : "Lisbon",
"statuses_count" : 4532,
"screen_name" : "foo",
"followers_count" : 43,
"id" : 37486278
},
"id" : NumberLong("22819398301")
},
{
"retweet_count" : 12,
"user" : {
"time_zone" : "Brasilia",
"statuses_count" : 132,
"screen_name" : "test2",
"followers_count" : 4,
"id" : 37486279
},
"id" : NumberLong("22819398323")
},
{
"retweet_count" : 4235,
"user" : {
"time_zone" : "Brasilia",
"statuses_count" : 33,
"screen_name" : "test4",
"followers_count" : 2,
"id" : 37486280
},
"id" : NumberLong("22819398308")
},
{
"retweet_count" : 562,
"user" : {
"time_zone" : "Kenya",
"statuses_count" : 672,
"screen_name" : "Kiptot",
"followers_count" : 169,
"id" : 37486281
},
"id" : NumberLong("22819398374")
},
{
"retweet_count" : 789,
"user" : {
"time_zone" : "Brasilia",
"statuses_count" : 5263,
"screen_name" : "test231",
"followers_count" : 8282,
"id" : 37486
},
"id" : NumberLong("22819398331")
}
]);
The Magic:
db.tweet.aggregate([
{
'$match': {
"user.statuses_count": {"$gt":99 },
"user.time_zone": "Brasilia"
}
},
{
"$group": {
"_id": "$user.id",
"followers": { "$max": "$user.followers_count" },
"doc": {
"$addToSet": "$$ROOT"
}
}
},
{
"$sort": { "followers": -1 }
},
{
"$unwind": "$doc"
},
{
"$project": {
"_id": 0,
"userId": "$_id",
"screen_name": "$doc.user.screen_name",
"retweet_count": "$doc.retweet_count",
"followers": 1
}
},
{
"$limit": 1
}
]);
Output:
/* 1 */
{
"result" : [
{
"userId" : 37486,
"screen_name" : "test231",
"retweet_count" : 789,
"followers" : 8282
}
],
"ok" : 1
}
-- UPDATE --
Python implementation:
>>> from bson.son import SON
>>> pipeline = [
... {"$match": {"user.statuses_count": {"$gt": 99}, "user.time_zone": "Brasilia"}},
... {"$group": {"_id": "$user.id", "followers": { "$max": "$user.followers_count" }, "doc": {"$addToSet": "$$ROOT"}}},
... {"$sort": {"followers": -1 }},
... {"$unwind": "$doc"}, {"$project": {"_id": 0, "userId": "$_id", "screen_name": "$doc.user.screen_name", "retweet_count": "$doc.retweet_count", "followers": 1}},
... {"$limit": 1}
... ]
>>> list(db.tweet.aggregate(pipeline))
[{u'userId': 37486, u'screen_name': u'test231', u'retweet_count': 789, u'followers': 8282}]

Returning an array in mongodb

I'm trying to return certain elements of an array in the document below.
{
"_id": 2,
"awardAmount": 6000,
"url": "www.url.com",
"numAwards": 3,
"award": "Faculty Seed Research Grant",
"Type": "faculty",
"Applicates": [
{
"School": "psu",
"Name": "tom",
"URL": "www.url.com",
"Time": "",
"Research": "",
"Budge": 7500,
"appId": 100,
"citizenship": "us",
"Major": "mat",
"preAwards": "None",
"Advisor": ""
},
{
"School": "ffff",
"Name": "KEVIN",
"URL": "www.url.com",
"Time": "5/5/5-6/6/6",
"Research": "topology",
"Budge": 9850,
"appId": 101,
"citizenship": "us",
"Major": "gym",
"preAwards": "None",
"Advisor": "Dr. cool",
"Evaluators": [
{
"abstractScore": 3,
"goalsObjectivesScore": 4,
"evalNum": 1
},
{
"abstractScore": 545646,
"goalsObjectivesScore": 46546,
"evalNum": 2
}
]
}
]
}
I want only the "Applicates" data if they have an "Evaluators" field. Here is what I was trying
db.coll.find({'Applicates.Evaluators':{'$exists': True }})
This gives me the whole document but I just want "Applicates" data that have the "Evaluators" field in it like this.
{
"_id": 2,
"awardAmount": 6000,
"url": "www.url.com",
"numAwards": 3,
"award": "Faculty Seed Research Grant",
"Type": "faculty",
"Applicates": [
{
"School": "ffff",
"Name": "KEVIN",
"URL": "www.url.com",
"Time": "5/5/5-6/6/6",
"Research": "topology",
"Budge": 9850,
"appId": 101,
"citizenship": "us",
"Major": "gym",
"preAwards": "None",
"Advisor": "Dr. cool",
"Evaluators": [
{
"abstractScore": 3,
"goalsObjectivesScore": 4,
"evalNum": 1
},
{
"abstractScore": 545646,
"goalsObjectivesScore": 46546,
"evalNum": 2
}
]
}
]
}
Try this (the key is using $unwind operator)
db.coll.aggregate(
[
{ $match : {'Applicates.Evaluators':{'$exists': true }} },
{ $unwind : "$Applicates" },
{ $match : {'Applicates.Evaluators':{'$exists': true }} },
{ $group : { _id : "$_id",
'Applicates' : {$push : '$Applicates'} ,
awardAmount : {$first : '$awardAmount'},
url : {$first : '$url'},
award : {$first : '$award'},
numAwards : {$first : '$numAwards'},
award : {$first : '$award'},
Type : {$first : '$Type'},
}},
])

Elasticsearch full-text autocomplete

I'm using Elasticsearch through the python requests library. I've set up my analysers like so:
"analysis" : {
"analyzer": {
"my_basic_search": {
"type": "standard",
"stopwords": []
},
"my_autocomplete": {
"type": "custom",
"tokenizer": "keyword",
"filter": ["lowercase", "autocomplete"]
}
},
"filter": {
"autocomplete": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20,
}
}
}
I've got a list of artists who I'd like to search for using autocomplete: my current test case is 'bill w', which should match 'bill withers' etc - the artist mapping looks like this (this is a output of GET http://localhost:9200/my_index/artist/_mapping):
{
"my_index" : {
"mappings" : {
"artist" : {
"properties" : {
"clean_artist_name" : {
"type" : "string",
"analyzer" : "my_basic_search",
"fields" : {
"autocomplete" : {
"type" : "string",
"index_analyzer" : "my_autocomplete",
"search_analyzer" : "my_basic_search"
}
}
},
"submitted_date" : {
"type" : "date",
"format" : "basic_date_time"
},
"total_count" : {
"type" : "integer"
}
}
}
}
}
}
...and then I run this query to do the autocomplete:
"query": {
"function_score": {
"query": {
"bool": {
"must" : { "match": { "clean_artist_name.autocomplete": "bill w" } },
"should" : { "match": { "clean_artist_name": "bill w" } },
}
},
"functions": [
{
"script_score": {
"script": "artist-score"
}
}
]
}
}
This seems to match artists that contain either 'bill' or 'w' as well as 'bill withers': I only wanted to match artists that contain that exact string. The analyser seems to be working fine, here is the output of http://localhost:9200/my_index/_analyze?analyzer=my_autocomplete&text=bill%20w:
{
"tokens" : [ {
"token" : "b",
"start_offset" : 0,
"end_offset" : 6,
"type" : "word",
"position" : 1
}, {
"token" : "bi",
"start_offset" : 0,
"end_offset" : 6,
"type" : "word",
"position" : 1
}, {
"token" : "bil",
"start_offset" : 0,
"end_offset" : 6,
"type" : "word",
"position" : 1
}, {
"token" : "bill",
"start_offset" : 0,
"end_offset" : 6,
"type" : "word",
"position" : 1
}, {
"token" : "bill ",
"start_offset" : 0,
"end_offset" : 6,
"type" : "word",
"position" : 1
}, {
"token" : "bill w",
"start_offset" : 0,
"end_offset" : 6,
"type" : "word",
"position" : 1
} ]
}
So why is this not excluding matches with just 'bill' or 'w' in there? Is there something in my query that is allowing the results that only match with the my_basic_search analyser?
I believe you need a "term" filter instead of a "match" one for your "must". You already have split your artist names in ngrams so your searching text should match exactly one of the ngrams. For this to happen you need a "term" that will match exactly the ngrams:
"query": {
"function_score": {
"query": {
"bool": {
"must" : { "term": { "clean_artist_name.autocomplete": "bill w" } },
"should" : { "match": { "clean_artist_name": "bill w" } },
}
},
"functions": [
{
"script_score": {
"script": "artist-score"
}
}
]
}
}

Categories