I use the phonetic analysis plugin from elastic search to do some string matching thanks to phonetic transformation.
My problem is, how to get phonetic transformation processed by elastic search in the result of the query?.
First, I create an index with a metaphone transformation:
request_body = {
'settings': {
'index': {
'analysis': {
'analyzer': {
'metaphone_analyzer': {
'tokenizer':
'standard',
'filter': [
'ascii_folding_filter', 'lowercase',
'metaphone_filter'
]
}
},
'filter': {
'metaphone_filter': {
'type': 'phonetic',
'encoder': 'metaphone',
'replace': False
},
'ascii_folding_filter': {
'type': 'asciifolding',
'preserve_original': True
}
}
}
}
},
'mappings': {
'person_name': {
'properties': {
'full_name': {
'type': 'text',
'fields': {
'metaphone_field': {
'type': 'string',
'analyzer': 'metaphone_analyzer'
}
}
}
}
}
}
}
res = es.indices.create(index="my_index", body=request_body)
Then, I add some data:
# Add some data
names = [{
"full_name": "John Doe"
}, {
"full_name": "Bob Alice"
}, {
"full_name": "Foo Bar"
}]
for name in names:
res = es.index(index="my_index",
doc_type='person_name',
body=name,
refresh=True)
And finally, I query a name:
es.search(index="my_index",
body={
"size": 5,
"query": {
"multi_match": {
"query": "Jon Doe",
"fields": "*_field"
}
}
})
Search returns:
{
'took': 1,
'timed_out': False,
'_shards': {
'total': 5,
'successful': 5,
'skipped': 0,
'failed': 0
},
'hits': {
'total':
1,
'max_score':
0.77749264,
'hits': [{
'_index': 'my_index',
'_type': 'person_name',
'_id': 'AWwYjl4Mqo63y_hLp5Yl',
'_score': 0.77749264,
'_source': {
'full_name': 'John Doe'
}
}]
}
}
In the search return I would like to get the phonetic transformation of the names in elastic search (also from the query name but it is less important) when I execute the search.
I know, that I could use explain API but I would like to avoid a 2nd request, and moreover the explain API seems a little "overkill" for what I want to achieve.
Thanks !
It doesn't look like an easy thing to implement in an Elasticsearch query, but you could try analyze API and scripted fields with fielddata enabled, and term vectors might come handy. Here's how.
Retrieve tokens from an arbitrary query
Analyze API is a great tool if you want to understand how exactly does Elasticsearch tokenize your query.
Using your mapping you could do, for example:
GET myindex/_analyze
{
"analyzer": "metaphone_analyzer",
"text": "John Doe"
}
And get something like this as a result:
{
"tokens": [
{
"token": "JN",
"start_offset": 0,
"end_offset": 4,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "john",
"start_offset": 0,
"end_offset": 4,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "T",
"start_offset": 5,
"end_offset": 8,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "doe",
"start_offset": 5,
"end_offset": 8,
"type": "<ALPHANUM>",
"position": 1
}
]
}
This is technically a different query, but still might be useful.
Retrieve tokens from a field of a document
In theory, we could try to retrieve the very same tokens which analyze API returned in the previous section, from the documents matched by our query.
In practice Elasticsearch will not store the tokens of a text field it has just analyzed: fielddata is disabled by default. We need to enable it:
PUT /myindex
{
"mappings": {
"person_name": {
"properties": {
"full_name": {
"fields": {
"metaphone_field": {
"type": "text",
"analyzer": "metaphone_analyzer",
"fielddata": true
}
},
"type": "text"
}
}
}
},
"settings": {
...
}
}
Now, we can use scripted fields to ask Elasticsearch to return those tokens.
The query might look like this:
POST myindex/_search
{
"script_fields": {
"my tokens": {
"script": {
"lang": "painless",
"source": "doc[params.field].values",
"params": {
"field": "full_name.metaphone_field"
}
}
}
}
}
And the response would look like this:
{
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "myindex",
"_type": "person_name",
"_id": "123",
"_score": 1,
"fields": {
"my tokens": [
"JN",
"T",
"doe",
"john"
]
}
}
]
}
}
As you can see, the very same tokens (but in random order).
Can we retrieve also the information about location of these tokens in the document?
Retrieving tokens with their positions
term vectors may help. To be able to use them we actually don't need fielddata enabled. We could lookup term vectors for a document:
GET myindex/person_name/123/_termvectors
{
"fields" : ["full_name.metaphone_field"],
"offsets" : true,
"positions" : true
}
This would return something like this:
{
"_index": "myindex",
"_type": "person_name",
"_id": "123",
"_version": 1,
"found": true,
"took": 1,
"term_vectors": {
"full_name.metaphone_field": {
"field_statistics": {
"sum_doc_freq": 4,
"doc_count": 1,
"sum_ttf": 4
},
"terms": {
"JN": {
"term_freq": 1,
"tokens": [
{
"position": 0,
"start_offset": 0,
"end_offset": 4
}
]
},
"T": {
"term_freq": 1,
"tokens": [
{
"position": 1,
"start_offset": 5,
"end_offset": 8
}
]
},
"doe": {
"term_freq": 1,
"tokens": [
{
"position": 1,
"start_offset": 5,
"end_offset": 8
}
]
},
"john": {
"term_freq": 1,
"tokens": [
{
"position": 0,
"start_offset": 0,
"end_offset": 4
}
]
}
}
}
}
}
This gives a way to get the tokens of a field of a document like the analyzer produced them.
Unfortunately, as of my knowledge, there is no way to combine these three queries into a single one. Also fielddata should be used with caution since it uses a lot of memory.
Hope this helps!
Related
I have 10k+ records in elastic search. one of the fields(dept) holds data in form of array
eg records are
{
"username": "tom",
"dept": [
"cust_service",
"sales_rpr",
"store_in",
],
"location": "NY"
}
{
"username": "adam",
"dept": [
"cust_opr",
"floor_in",
"mg_cust_opr",
],
"location": "MA"
}
.
.
.
I want to do autocomplete on dept field, if user search for cus it should return
["cust_service", "cust_opr", "mg_cust_opr"]
With best match at the top
I have made the query
query = {
"_source": [],
"size": 0,
"min_score": 0.5,
"query": {
"bool": {
"must": [
{
"wildcard": {
"dept": {
"value": "*cus*"
}
}
}
],
"filter": [],
"should": [],
"must_not": []
}
},
"aggs": {
"auto_complete": {
"terms": {
"field": f"dept.raw",
"size": 20,
"order": {"max_score": 'desc'}
},
"aggs": {
"max_score": {
"avg": {"script": "_score"}
}
}
}
}
}
It is not giving ["cust_service", "cust_opr", "mg_cust_opr"] instead gives other answers which are irrelevant to search key(cus). but when field is just string instead of array it is giving the result as expected.
How do i solve this problem?
Thanks in advance!
I’m currently trying to do a hybrid search on two indexes: a full text index and knn_vector (word embeddings) index. Currently, over 10’000 documents from Wikipedia are indexed on an ES stack, indexed on both of these fields (see mapping: “content”, “embeddings”). The queries are well known n-grams (1,2,3) that should yield results (words are taken from the wikipedia pages that are indexed).
It is also important to note that the knn_vector index is defined as a nested object.
This is the current mapping of the items indexed:
mapping = {
"settings": {
"index": {
"knn": True,
"knn.space_type": "cosinesimil"
}
},
"mappings": {
"dynamic": 'strict',
"properties": {
"elasticId":
{ 'type': 'text' },
"owners":
{ 'type': 'text' },
"type":
{ 'type': 'keyword' },
"accessLink":
{ 'type': 'keyword' },
"content":
{ 'type': 'text'},
"embeddings": {
'type': 'nested',
"properties": {
"vector": {
"type": "knn_vector",
"dimension": VECTOR_DIM,
},
},
},
}
My goal is to compare the query scores on both indexes to understand if one is more efficient than the other (full text vs. knn_vectors), and how elastic chooses to return an object from based on the score of each index.
I understand I could simply split the queries (two separate queries), but ideally, we might want to use a hybrid search of this type in production.
This is the current query that searches on both full text and the knn_vectors:
def MakeHybridSearch(query):
query_vector = convert_to_embeddings(query)
result = elastic.search({
"explain": True,
"profile": True,
"size": 2,
"query": {
"function_score": { #function_score
"functions": [
{
"filter": {
"match": {
"text": {
"query": query,
'boost': "5",
},
},
},
"weight": 2
},
{
"filter": {
'script': {
'source': 'knn_score',
'params': {
'field': 'doc_vector',
'vector': query_vector,
'space_type': "l2"
}
}
},
"weight": 4
}
],
"max_boost": 5,
"score_mode": "replace",
"boost_mode": "multiply",
"min_score": 5
}
}
}, index='files_en', size=1000)
The current problem is that all queries are not returning anything.
Result:
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 0,
"relation": "eq"
},
"max_score": null,
"hits": []
},
Even when the query does return a response, it returns hits with a score of 0 (score =0).
Is there an error in the query structure ? Could this be on the mapping side ? If not, is there a better of way of doing this ?
Thank you for your help !
I've seen issues similar to this when a solution doesn't allow for prefiltering prior to running k-NN and requires k-NN be run prior to filters. This often leads to null set results as none of the hits from the initial k-NN align with the subsequent filters.
Here is an option that might help as it allows for prefiltering prior to k-NN: https://medium.com/gsi-technology/scalable-semantic-vector-search-with-elasticsearch-e79f9145ba8e
now, aws elasticsearch support post-filtering.
https://docs.aws.amazon.com/opensearch-service/latest/developerguide/knn.html
The sampe query in the page is
{
"size": 2,
"query": {
"knn": {
"my_vector2": {"vector": [2, 3, 5, 6], "k": 2}
}
},
"post_filter": {
"range": {
"price": {"gte": 6, "lte": 10}
}
}
}
aws elasticsearch also support pre-filtering using 'script_score' (but it is different from 'script_score' in original elasticsearch)
https://opendistro.github.io/for-elasticsearch-docs/docs/knn/knn-score-script/
The sample query in the page is
{
"size": 2,
"query": {
"script_score": {
"query": {
"bool": {
"filter": {
"term": {
"color": "BLUE"
}
}
}
},
"script": {
"lang": "knn",
"source": "knn_score",
"params": {
"field": "my_binary",
"query_value": "U29tZXRoaW5nIEltIGxvb2tpbmcgZm9y",
"space_type": "hammingbit"
}
}
}
}
}
I have a document that references another document, and I'd like to join these documents and filter based on the contents of an array in the child document:
deployment_machine document:
{
"_id": 1,
"name": "Test Machine",
"machine_status": 10,
"active": true
}
machine_status document:
{
"_id": 10,
"breakdown": [
{
"status_name": "Rollout",
"state": "complete"
},
{
"status_name": "Deploying",
"state": "complete"
}
]
}
I'm using Mongo 3.6 and am having mixed success with the lookup and pipeline, heres the object I'm using in the python MongoEngine being passed to the aggregate function:
pipeline = [
{'$match': {'breakdown': {'$elemMatch': {'status_name': 'Rollout'}}}},
{'$lookup':
{
'from': 'deployment_machine',
'let': {'status_id': '$_id'},
'pipeline': [
{'$match':
{'$expr':
{'$and': [
{'$eq': ['$machine_status', '$$status_id']},
]},
}
}
],
'as': 'result',
},
},
{'$project': {
'breakdown': {'$filter': {
'input': '$breakdown',
'as': 'breakdown',
'cond': {'$eq': ['$$breakdown.status_name', 'Rollout']}
}}
}},
]
result = list(MachineStatus.objects.aggregate(*pipeline))
This works well, but how can I exclude results where the Deployment Machine isn't active? I feel it must go in the project but can't find a condition that works. Any help appreciated.
You can add more condition in $lookup pipeline
pipeline = [
{ $match: { breakdown: { $elemMatch: { status_name: "Rollout" } } } },
{
$lookup: {
from: "deployment_machine",
let: { status_id: "$_id" },
pipeline: [
{
$match: {
$expr: { $eq: ["$machine_status", "$$status_id"] },
active: false
}
}
],
as: "result",
}
},
{
$project: {
breakdown: {
$filter: {
input: "$breakdown",
as: "breakdown",
cond: { $eq: ["$$breakdown.status_name", "Rollout"] },
}
}
}
}
];
I have a MongoDB document structure like following:
Structure
{
"stores": [
{
"items": [
{
"feedback": [],
"item_category": "101",
"item_id": "10"
},
{
"feedback": [],
"item_category": "101",
"item_id": "11"
}
]
},
{
"items": [
{
"feedback": [],
"item_category": "101",
"item_id": "10"
},
{
"feedback": ["A feedback"],
"item_category": "101",
"item_id": "11"
},
{
"feedback": [],
"item_category": "101",
"item_id": "12"
},
{
"feedback": [],
"item_category": "102",
"item_id": "13"
},
{
"feedback": [],
"item_category": "102",
"item_id": "14"
}
],
"store_id": 500
}
]
}
This is a single document in a collection. Some field are deleted to produce minimal representation of the data.
What I want is to get items only if the feedback field in the items array is not empty. The expected result is:
Expected result
{
"stores": [
{
"items": [
{
"feedback": ["A feedback"],
"item_category": "101",
"item_id": "11"
}
],
"store_id": 500
}
]
}
This is what I tried based on examples in this, which I think pretty same situation, but it didn't work. What's wrong with my query, isn't it the same situation in zipcode search example in the link? It returns everything like in the first JSON code, Structure:
What I tried
query = {
'date': {'$gte': since, '$lte': until},
'stores.items': {"$elemMatch": {"feedback": {"$ne": []}}}
}
Thanks.
Please try this :
db.yourCollectionName.aggregate([
{ $match: { 'date': { '$gte': since, '$lte': until }, 'stores.items': { "$elemMatch": { "feedback": { "$ne": [] } } } } },
{ $unwind: '$stores' },
{ $match: { 'stores.items': { "$elemMatch": { "feedback": { "$ne": [] } } } } },
{ $unwind: '$stores.items' },
{ $match: { 'stores.items.feedback': { "$ne": [] } } },
{ $group: { _id: { _id: '$_id', store_id: '$stores.store_id' }, items: { $push: '$stores.items' } } },
{ $project: { _id: '$_id._id', store_id: '$_id.store_id', items: 1 } },
{ $group: { _id: '$_id', stores: { $push: '$$ROOT' } } },
{ $project: { 'stores._id': 0 } }
])
We've all these stages as you need to operate on an array of arrays, this query is written assuming you're dealing with a large set of data, Since you're filtering on dates just in case if your documents size is way less after first $match then you can avoid following $match stage which is in between two $unwind's.
Ref 's :
$match,
$unwind,
$project,
$group
This aggregate query gets the needed result (using the provided sample document and run from the mongo shell):
db.stores.aggregate( [
{ $unwind: "$stores" },
{ $unwind: "$stores.items" },
{ $addFields: { feedbackExists: { $gt: [ { $size: "$stores.items.feedback" }, 0 ] } } },
{ $match: { feedbackExists: true } },
{ $project: { _id: 0, feedbackExists: 0 } }
] )
I have a collection with fields like this:
{
"_id":"5cf54857bbc85fd0ff5640ba",
"book_id":"5cf172220fb516f706d00591",
"tags":{
"person":[
{"start_match":209, "length_match":6, "word":"kimmel"}
],
"organization":[
{"start_match":107, "length_match":12, "word":"philadelphia"},
{"start_match":209, "length_match":13, "word":"kimmel center"}
],
"location":[
{"start_match":107, "length_match":12, "word":"philadelphia"}
]
},
"deleted":false
}
I want to collect the different words in the categories and count it.
So, the output should be like this:
{
"response": [
{
"tag": "location",
"tag_list": [
{
"count": 31,
"phrase": "philadelphia"
},
{
"count": 15,
"phrase": "usa"
}
]
},
{
"tag": "organization",
"tag_list": [ ... ]
},
{
"tag": "person",
"tag_list": [ ... ]
},
]
}
The pipeline like this works:
def pipeline_func(tag):
return [
{'$replaceRoot': {'newRoot': '$tags'}},
{'$unwind': '${}'.format(tag)},
{'$group': {'_id': '${}.word'.format(tag), 'count': {'$sum': 1}}},
{'$project': {'phrase': '$_id', 'count': 1, '_id': 0}},
{'$sort': {'count': -1}}
]
But it make a request for each tag. I want to know how to make it in one request.
Thank you for attention.
As noted, there is a slight mismatch in the question data to the current claimed pipeline process since $unwind can only be used on arrays and the tags as presented in the question is not an array.
For the data presented in the question you basically want a pipeline like this:
db.collection.aggregate([
{ "$addFields": {
"tags": { "$objectToArray": "$tags" }
}},
{ "$unwind": "$tags" },
{ "$unwind": "$tags.v" },
{ "$group": {
"_id": {
"tag": "$tags.k",
"phrase": "$tags.v.word"
},
"count": { "$sum": 1 }
}},
{ "$group": {
"_id": "$_id.tag",
"tag_list": {
"$push": {
"count": "$count",
"phrase": "$_id.phrase"
}
}
}}
])
Again as per the note, since tags is in fact an object then what you actually need in order to collect data based on it's sub-keys as the question is asking, is to turn that essentially into an array of items.
The usage of $replaceRoot in your current pipeline would seem to indicate that $objectToArray is of fair use here, as it is available from later patch releases of MongoDB 3.4, being the bare minimal version you should be using in production right now.
That $objectToArray actually does pretty much what the name says and produces an array ( or "list" to be more pythonic ) of entries broken into key and value pairs. These are essentially a "list" of objects ( or "dict" entries ) which have the keys k and v respectively. The output of the first pipeline stage would look like this on the supplied document:
{
"book_id": "5cf172220fb516f706d00591",
"tags": [
{
"k": "person",
"v": [
{
"start_match": 209,
"length_match": 6,
"word": "kimmel"
}
]
}, {
"k": "organization",
"v": [
{
"start_match": 107,
"length_match": 12,
"word": "philadelphia"
}, {
"start_match": 209,
"length_match": 13,
"word": "kimmel center"
}
]
}, {
"k": "location",
"v": [
{
"start_match": 107,
"length_match": 12,
"word": "philadelphia"
}
]
}
],
"deleted" : false
}
So you should be able to see how you can now easily access those k values and use them in grouping, and of course the v is the standard array as well. So it's just the two $unwind stages as shown and then two $group stages. Being the first $group in order to collection over the combination of keys, and the second to collect as per the main grouping key whilst adding the other accumulations to a "list" within that entry.
Of course the output by the above listing is not exactly how you asked for in the question, but the data is basically there. You can optionally add an $addFields or $project stage to essentially rename the _id key as the final aggregation stage:
{ "$addFields": {
"_id": "$$REMOVE",
"tag": "$_id"
}}
Or simply do something pythonic with a little list comprehension on the cursor output:
cursor = db.collection.aggregate([
{ "$addFields": {
"tags": { "$objectToArray": "$tags" }
}},
{ "$unwind": "$tags" },
{ "$unwind": "$tags.v" },
{ "$group": {
"_id": {
"tag": "$tags.k",
"phrase": "$tags.v.word"
},
"count": { "$sum": 1 }
}},
{ "$group": {
"_id": "$_id.tag",
"tag_list": {
"$push": {
"count": "$count",
"phrase": "$_id.phrase"
}
}
}}
])
output = [{ 'tag': doc['_id'], 'tag_list': doc['tag_list'] } for doc in cursor]
print({ 'response': output });
And final output as a "list" you can use for response:
{
"tag_list": [
{
"count": 1,
"phrase": "philadelphia"
}
],
"tag": "location"
},
{
"tag_list": [
{
"count": 1,
"phrase": "kimmel"
}
],
"tag": "person"
},
{
"tag_list": [
{
"count": 1,
"phrase": "kimmel center"
}, {
"count": 1,
"phrase": "philadelphia"
}
],
"tag": "organization"
}
Noting that using a list comprehension approach you have a bit more control over the order of "keys" as output, as MongoDB itself would simply append NEW key names in a projection keeping existing keys ordered first. If that sort of thing is important to you that is. Though it really should not be since all Object/Dict like structures should not be considered to have any set order of keys. That's what arrays ( or lists ) are for.