Elasticsearch full-text autocomplete - python

I'm using Elasticsearch through the python requests library. I've set up my analysers like so:
"analysis" : {
"analyzer": {
"my_basic_search": {
"type": "standard",
"stopwords": []
},
"my_autocomplete": {
"type": "custom",
"tokenizer": "keyword",
"filter": ["lowercase", "autocomplete"]
}
},
"filter": {
"autocomplete": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20,
}
}
}
I've got a list of artists who I'd like to search for using autocomplete: my current test case is 'bill w', which should match 'bill withers' etc - the artist mapping looks like this (this is a output of GET http://localhost:9200/my_index/artist/_mapping):
{
"my_index" : {
"mappings" : {
"artist" : {
"properties" : {
"clean_artist_name" : {
"type" : "string",
"analyzer" : "my_basic_search",
"fields" : {
"autocomplete" : {
"type" : "string",
"index_analyzer" : "my_autocomplete",
"search_analyzer" : "my_basic_search"
}
}
},
"submitted_date" : {
"type" : "date",
"format" : "basic_date_time"
},
"total_count" : {
"type" : "integer"
}
}
}
}
}
}
...and then I run this query to do the autocomplete:
"query": {
"function_score": {
"query": {
"bool": {
"must" : { "match": { "clean_artist_name.autocomplete": "bill w" } },
"should" : { "match": { "clean_artist_name": "bill w" } },
}
},
"functions": [
{
"script_score": {
"script": "artist-score"
}
}
]
}
}
This seems to match artists that contain either 'bill' or 'w' as well as 'bill withers': I only wanted to match artists that contain that exact string. The analyser seems to be working fine, here is the output of http://localhost:9200/my_index/_analyze?analyzer=my_autocomplete&text=bill%20w:
{
"tokens" : [ {
"token" : "b",
"start_offset" : 0,
"end_offset" : 6,
"type" : "word",
"position" : 1
}, {
"token" : "bi",
"start_offset" : 0,
"end_offset" : 6,
"type" : "word",
"position" : 1
}, {
"token" : "bil",
"start_offset" : 0,
"end_offset" : 6,
"type" : "word",
"position" : 1
}, {
"token" : "bill",
"start_offset" : 0,
"end_offset" : 6,
"type" : "word",
"position" : 1
}, {
"token" : "bill ",
"start_offset" : 0,
"end_offset" : 6,
"type" : "word",
"position" : 1
}, {
"token" : "bill w",
"start_offset" : 0,
"end_offset" : 6,
"type" : "word",
"position" : 1
} ]
}
So why is this not excluding matches with just 'bill' or 'w' in there? Is there something in my query that is allowing the results that only match with the my_basic_search analyser?

I believe you need a "term" filter instead of a "match" one for your "must". You already have split your artist names in ngrams so your searching text should match exactly one of the ngrams. For this to happen you need a "term" that will match exactly the ngrams:
"query": {
"function_score": {
"query": {
"bool": {
"must" : { "term": { "clean_artist_name.autocomplete": "bill w" } },
"should" : { "match": { "clean_artist_name": "bill w" } },
}
},
"functions": [
{
"script_score": {
"script": "artist-score"
}
}
]
}
}

Related

Compare created_time and updated_time in elasticsearch using python

I have tried this query:
body = {
"query": {
"bool": {
"must_not": [{
"match": {
"script": "doc['updated_time'].value == doc['created_time'].value"
}
}]
}
}
}
And my indexed document is:
"hits" : [
{
"_index" : "cam_canvas_update",
"_type" : "_doc",
"_id" : "101",
"_score" : 1.0,
"_source" : {
"created_time" : "2021-08-11T13:44:13.282406282Z",
"updated_time" : "2021-08-11T13:44:13.285397500Z",
"engagement" : "Ford",
"tag_set_2" : "Renew",
"tag_set_3" : "Disputed",
"instance_numbers" : 1,
"canvas_name" : "First",
"recordid" : "ford1",
"pf" : "C6000",
"tag_set_1" : "Sally",
"ldos_date" : "7/7/2018",
"architecture" : "webex"
}
]
I want to compare created_time and updated time of all documents
and as output need only updated documents.
Want to write csv only with that updated documents in elasticsearch.
You need to use filter and script in your query like below:
{
"query": {
"bool": {
"filter": [{
"script": {
"script": "doc['updated_time'].value != doc['created_time'].value"
}
}]
}
}
}
If you don't want milliseconds to be compared, you can use this script instead of previous version:
{
"query": {
"bool": {
"filter": [
{
"script": {
"script": {
"inline": "doc['updated_time'].value.getMillis()/1000 != doc['created_time'].value.getMillis()/1000",
"lang": "painless"
}
}
}
]
}
}
}
Please let me know if you have any problem with this query.

Filter using Range-Elasticsearch

Following is the kibana JSON of a single row,
{
"_index": "questionanswers",
"_type": "doc",
"_id": "3",
"_version": 1,
"_score": 0,
"_source": {
"question": {
"id": 3,
"text": "Your first salary",
"answer_type": "FL",
"question_type": "BQ"
},
"candidate": {
"id": 13
},
"job": {
"id": 6
},
"id": 3,
"status": "AN",
"answered_on": "2019-07-12T09:26:01+00:00",
"answer": "12222222"
},
"fields": {
"answered_on": [
"2019-07-12T09:26:01.000Z"
]
}
}
I have an sql query like,
Select * from questionanswers where question.id = 3 and answer between 1250 and 1253666
I have converted this to elasticsearch query as follows,
{
"size": 1000,
"query": {
"bool": {
"must": [
{
"term": {
"question.id":3
}
},
{
"range": {
"answer": {
"from": 1250,
"to": 1253666999,
"include_lower": true,
"include_upper": true,
"boost": 1
}
}
}
],
"adjust_pure_negative": true,
"boost": 1
}
}
}
Here answer is declared as String , But i holds Date,FLoat and String values.
"question": {
"id": 3,
"text": "Your first salary",
"answer_type": "FL",
"question_type": "BQ"
},
Here answer_type tells which type of answer it is expecting.
When I try to run this query I am not getting desired results. I am getting an empty response on this hit.
But actually, there is a row that satisfies this query.
How my elasticsearch query should be so that I can filter with
question.id = 3 , question.answer_type = "FL" and answer between 1250 and 1253666```
see your document again. The answer is a string value, and you are treating as a number in your query. So it does not work obviously.
change the mapping for this field to number.
Here is the document I indexed in a test index and ran your query again and it works
Indexing the document ( see the field answer)
POST /so-index4/_doc/1
{
"question": {
"id": 3,
"text": "Your first salary",
"answer_type": "FL",
"question_type": "BQ"
},
"candidate": {
"id": 13
},
"job": {
"id": 6
},
"id": 3,
"status": "AN",
"answered_on": "2019-07-12T09:26:01+00:00",
"answer": 12222222,
"fields": {
"answered_on": [
"2019-07-12T09:26:01.000Z"
]
}
}
and the query (same query that you provided above)
GET /so-index4/_search
{
"size": 1000,
"query": {
"bool": {
"must": [
{
"term": {
"question.id":3
}
},
{
"range": {
"answer": {
"from": 1250,
"to": 1253666999,
"include_lower": true,
"include_upper": true,
"boost": 1
}
}
}
],
"adjust_pure_negative": true,
"boost": 1
}
}
}
the result
{
"took" : 5,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 2.0,
"hits" : [
{
"_index" : "so-index4",
"_type" : "_doc",
"_id" : "1",
"_score" : 2.0,
"_source" : {
"question" : {
"id" : 3,
"text" : "Your first salary",
"answer_type" : "FL",
"question_type" : "BQ"
},
"candidate" : {
"id" : 13
},
"job" : {
"id" : 6
},
"id" : 3,
"status" : "AN",
"answered_on" : "2019-07-12T09:26:01+00:00",
"answer" : 12222222,
"fields" : {
"answered_on" : [
"2019-07-12T09:26:01.000Z"
]
}
}
}
]
}
}

Matching / Mapping lists with elasticsearch

There is a list in mongodb,
eg:
db_name = "Test"
collection_name = "Map"
db.Map.findOne()
{
"_id" : ObjectId(...),
"Id" : "576",
"FirstName" : "xyz",
"LastName" : "abc",
"skills" : [
"C++",
"Java",
"Python",
"MongoDB",
]
}
There is a list in elastcisearch index (I am using kibana to execute queries)
GET /user/_search
{
"took" : 31,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 7,
"max_score" : 1.0,
"hits" : [
{
"_index" : "customer",
"_type" : "_doc",
"_id" : "5",
"_score" : 1.0,
"_source" : {
"name" : "xyz abc"
"Age" : 21,
"skills" : [
"C++",
"Java",
"Python",
"MongoDB",
]
}
},
]
}
}
Can anyone help with the elasticsearch query that will match both the records based on skills.
I am using python to write the code
If a match is found, I am trying to get the first name and last name of that user
First name : "xyz"
Last name : "abc"
Assuming you are indexing all the document in elastic and of these you want to match documents where skills has both java and mongodb the query will be as:
{
"query": {
"bool": {
"filter": [
{
"term": {
"skills": "mongodb"
}
},
{
"term": {
"skills": "java"
}
}
]
}
}
}

Nothing happens when trying to use $project

I am new to mongodb and still sitting on the same pipeline thing. I dont understand why my usage of $project did not generate any output at all ?
def make_pipeline():
# complete the aggregation pipeline
pipeline = [
{
'$match': {
"user.statuses_count": {"$gt":99 },
"user.time_zone": "Brasilia"
}
},
{
"$group": {
"_id": "$user.id",
"followers": { "$max": "$user.followers_count" }
}
},
{
"$sort": { "followers": -1 }
},{"$project": {
"userId": "$user.id",
"screen_name": "$user.screen_name",
"retweet_count": "$retweet_count"}},
{
"$limit" : 1
}
]
Any ideas?
Try this aggregation pipeline below, it should give you the desired output.
Using Mongo shell:
Test documents (with minimum test case):
db.tweet.insert([
{
"retweet_count" : 23,
"user" : {
"time_zone" : "Brasilia",
"statuses_count" : 2475,
"screen_name" : "Catherinemull",
"followers_count" : 169,
"id" : 37486277
},
"id" : NumberLong("22819398300")
},
{
"retweet_count" : 7,
"user" : {
"time_zone" : "Lisbon",
"statuses_count" : 4532,
"screen_name" : "foo",
"followers_count" : 43,
"id" : 37486278
},
"id" : NumberLong("22819398301")
},
{
"retweet_count" : 12,
"user" : {
"time_zone" : "Brasilia",
"statuses_count" : 132,
"screen_name" : "test2",
"followers_count" : 4,
"id" : 37486279
},
"id" : NumberLong("22819398323")
},
{
"retweet_count" : 4235,
"user" : {
"time_zone" : "Brasilia",
"statuses_count" : 33,
"screen_name" : "test4",
"followers_count" : 2,
"id" : 37486280
},
"id" : NumberLong("22819398308")
},
{
"retweet_count" : 562,
"user" : {
"time_zone" : "Kenya",
"statuses_count" : 672,
"screen_name" : "Kiptot",
"followers_count" : 169,
"id" : 37486281
},
"id" : NumberLong("22819398374")
},
{
"retweet_count" : 789,
"user" : {
"time_zone" : "Brasilia",
"statuses_count" : 5263,
"screen_name" : "test231",
"followers_count" : 8282,
"id" : 37486
},
"id" : NumberLong("22819398331")
}
]);
The Magic:
db.tweet.aggregate([
{
'$match': {
"user.statuses_count": {"$gt":99 },
"user.time_zone": "Brasilia"
}
},
{
"$group": {
"_id": "$user.id",
"followers": { "$max": "$user.followers_count" },
"doc": {
"$addToSet": "$$ROOT"
}
}
},
{
"$sort": { "followers": -1 }
},
{
"$unwind": "$doc"
},
{
"$project": {
"_id": 0,
"userId": "$_id",
"screen_name": "$doc.user.screen_name",
"retweet_count": "$doc.retweet_count",
"followers": 1
}
},
{
"$limit": 1
}
]);
Output:
/* 1 */
{
"result" : [
{
"userId" : 37486,
"screen_name" : "test231",
"retweet_count" : 789,
"followers" : 8282
}
],
"ok" : 1
}
-- UPDATE --
Python implementation:
>>> from bson.son import SON
>>> pipeline = [
... {"$match": {"user.statuses_count": {"$gt": 99}, "user.time_zone": "Brasilia"}},
... {"$group": {"_id": "$user.id", "followers": { "$max": "$user.followers_count" }, "doc": {"$addToSet": "$$ROOT"}}},
... {"$sort": {"followers": -1 }},
... {"$unwind": "$doc"}, {"$project": {"_id": 0, "userId": "$_id", "screen_name": "$doc.user.screen_name", "retweet_count": "$doc.retweet_count", "followers": 1}},
... {"$limit": 1}
... ]
>>> list(db.tweet.aggregate(pipeline))
[{u'userId': 37486, u'screen_name': u'test231', u'retweet_count': 789, u'followers': 8282}]

pymongo get id for collection

i have this code:
def get_attribute_colour(colour_code):
attribute_colour_meta = db.attributes.aggregate([{ '$match': {"name.en-UK": "Colour"} },
{ '$unwind' : "$values" },
{ '$project': { "code" : "$values.code", "valueId": "$values._id"} },
{ '$match': {"code": colour_code} }])
return attribute_colour_meta['result']
that looks up a collection called attributes, which has the following structure:
> db.attributes.find({}).pretty();
{
"_id" : ObjectId("53b27bded901f26432996e00"),
"values" : [
{
"code" : "AQ",
"pmsCode" : "638c",
"name" : {
"en-UK" : "Aqua"
},
"tcxCode" : "16-4529 TCX",
"hexCode" : "#00aed8",
"images" : [
"AQ.jpg"
],
"_id" : ObjectId("53b27bded901f26432996d83")
},
{
"code" : "AQ",
"pmsCode" : "3115c",
"name" : {
"en-UK" : "Aqua"
},
"tcxCode" : "",
"hexCode" : "#00c4db",
"images" : [
"AQ.jpg"
],
"_id" : ObjectId("53b27bded901f26432996d84")
},
.....
}
],
"name" : {
"en-UK" : "Colour"
}
}
{
"_id" : ObjectId("53b27bded901f26432996e1b"),
"values" : [
{
"code" : 0,
"_id" : ObjectId("53b27bded901f26432996e01"),
"name" : {
"en-UK" : "0-3 MTHS"
}
},
.....
}
],
"name" : {
"en-UK" : "Size"
}
}
{
"_id" : ObjectId("53b27bded901f26432996e28"),
"values" : [
{
"Currency" : "GBP",
"_id" : ObjectId("53b27bded901f26432996e1c"),
"name" : {
"en-UK" : "Carton price list"
}
},
}
],
"name" : {
"en-UK" : "Price list"
}
}
>
basically, there are 3 attributes, colour, size and price list, each of which has sub-documents called values
in my def get_attribute_colour function, how do i return the _id for the attribute within the results, so that i get something like:
{ attributeId: ObjectId("53b27bded901f26432996e00"),
valueId: ObjectId("53b27bded901f26432996d83") }
the result does return the _id:
[{u'code': u'AQ', u'_id': ObjectId('53b27bded901f26432996e00'), u'valueId': ObjectId('53b27bded901f26432996d83')}]
but i don't see where this is specified?
any advice much appreciated.

Categories