How to query all values of a nested field with Elasticsearch - python

I would like to query a value in all data packages I have in Elasticsearch.
For example, I have the code :
"website" : "google",
"color" : [
{
"color1" : "red",
"color2" : "blue"
}
]
}
I have this code for an undefined number of website. I want to extract all the "color1" for all the websites I have. How can I do ? I tried with match_all and "size" : 0 but it didn't work.
Thanks a lot !

To be able to query nested object you would need to map them as a nested field first then you can query nested field like this:
GET //my-index-000001/_search
{
"aggs": {
"test": {
"nested": {
"path": "color"
},
"aggs": {
"test2": {
"terms": {
"field": "color.color1"
}
}
}
}
}
}
Your result should look like this for the query:
"aggregations": {
"test": {
"doc_count": 5,
"test2": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "red",
"doc_count": 4
},
{
"key": "gray",
"doc_count": 1
}
]
}
}
}
if you check the aggregation result back you will have list of your color1 with number of time it appeared in your documents.
For more information you can check Elasticsearch official documentation about Nested Field here and Nested aggregation here.

Related

ElasticSearch: Retrieve field and it's normalization

I want to retrieve a field as well as it's normalized version from Elasticsearch.
Here's my index definition and data
PUT normalizersample
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 1,
"refresh_interval": "60s",
"analysis": {
"normalizer": {
"my_normalizer": {
"filter": [
"lowercase",
"german_normalization",
"asciifolding"
],
"type": "custom"
}
}
}
},
"mappings": {
"_source": {
"enabled": true
},
"properties": {
"myField": {
"type": "text",
"store": true,
"fields": {
"keyword": {
"type": "keyword",
"store": true
},
"normalized": {
"type": "keyword",
"store": true,
"normalizer": "my_normalizer"
}
}
}
}
}
}
POST normalizersample/_doc/1
{
"myField": ["Andreas", "Ämdreas", "Anders"]
}
My first approach was to use scripted fields like
GET /myIndex/_search
{
"size": 100,
"query": {
"match_all": {}
},
"script_fields": {
"keyword": {
"script": "doc['myField.keyword']"
},
"normalized": {
"script": "doc['myField.normalized']"
}
}
}
However, since myField is an array, this returns two lists of strings per ES document and each of them are sorted alphabetically. Hence, the corresponding entries might not match to each other due to the normalization.
"hits" : [
{
"_index" : "normalizersample",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.0,
"fields" : {
"de" : [
"amdreas",
"anders",
"andreas"
],
"keyword" : [
"Anders",
"Andreas",
"Ämdreas"
]
}
}
]
While I would like to retrieve [(Andreas, andreas), (Ämdreas, amdreas) (Anders, anders)] or a similar format where I can match every entry to its normalization.
The only way I found was to call Term Vectors on both fields since they contain a position field, but this seems like a huge overhead to me. (https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-termvectors.html)
Is there a simpler way to retrieve tuples with the keyword and the normalized field?
Thanks a lot!

Getting first entry of each unique combination of fields from elasticsearch

I have an elasticsearch database that I upload entries like this:
{"s_food": "bread", "s_store": "Safeway", "s_date" : "2020-06-30", "l_run" : 28900, "l_covered": 1}
When I upload it to elasticsearch, it adds an _id, _type, #timestamp and _index fields. So the entries look sort of like this:
{"s_food": "bread", "s_store": "Safeway", "s_date" : "2020-06-30", "l_run" : 28900, "l_covered": 1, "_type": "_doc", "_index": "my_index", "_id": pe39u5hs874kee}
The way that I'm using the elasticsearch database results in the same original entries getting uploaded multiple times. In this example, I only care about the s_food, s_date, and l_run fields being a unique combination. Since I have so many entries, I'd like to use the elasticsearch scroll tool to go through all the matches. So far in elasticsearch, I've only seen people use aggregation to get buckets of each term and then they iterate over each partition. I would like to use something like aggregation to get an entire entry (just 1) for each unique combination of the three fields that I care about (food, date, run). Right now I use aggregation with a scroll like so:
GET /my-index/_search?scroll=25m
{
size: 10000,
aggs: {
foods: {
terms: {
field: s_food
},
aggs: {
dates: {
terms: {
field: s_date
},
aggs: {
runs: {
terms: {
field: l_run
}
}
}
}
}
}
}
Unfortunately this is only giving me the usual bucketed structure that I don't want. Is there something else I should try?
All you need is to use top-hits aggregation with size: 1. Read more about top-hits aggregation here.
The query would look like this:
{
"size": 10000,
"aggs": {
"foods": {
"terms": {
"field": "s_food"
},
"aggs": {
"dates": {
"terms": {
"field": "s_date"
},
"aggs": {
"runs": {
"terms": {
"field": "l_run"
},
"aggs": {
"topOne": {
"top_hits": {
"size": 1
}
}
}
}
}
}
}
}
}
}
}

How to make pymongo aggregation with count all elements and grouping by one request

I have a collection with fields like this:
{
"_id":"5cf54857bbc85fd0ff5640ba",
"book_id":"5cf172220fb516f706d00591",
"tags":{
"person":[
{"start_match":209, "length_match":6, "word":"kimmel"}
],
"organization":[
{"start_match":107, "length_match":12, "word":"philadelphia"},
{"start_match":209, "length_match":13, "word":"kimmel center"}
],
"location":[
{"start_match":107, "length_match":12, "word":"philadelphia"}
]
},
"deleted":false
}
I want to collect the different words in the categories and count it.
So, the output should be like this:
{
"response": [
{
"tag": "location",
"tag_list": [
{
"count": 31,
"phrase": "philadelphia"
},
{
"count": 15,
"phrase": "usa"
}
]
},
{
"tag": "organization",
"tag_list": [ ... ]
},
{
"tag": "person",
"tag_list": [ ... ]
},
]
}
The pipeline like this works:
def pipeline_func(tag):
return [
{'$replaceRoot': {'newRoot': '$tags'}},
{'$unwind': '${}'.format(tag)},
{'$group': {'_id': '${}.word'.format(tag), 'count': {'$sum': 1}}},
{'$project': {'phrase': '$_id', 'count': 1, '_id': 0}},
{'$sort': {'count': -1}}
]
But it make a request for each tag. I want to know how to make it in one request.
Thank you for attention.
As noted, there is a slight mismatch in the question data to the current claimed pipeline process since $unwind can only be used on arrays and the tags as presented in the question is not an array.
For the data presented in the question you basically want a pipeline like this:
db.collection.aggregate([
{ "$addFields": {
"tags": { "$objectToArray": "$tags" }
}},
{ "$unwind": "$tags" },
{ "$unwind": "$tags.v" },
{ "$group": {
"_id": {
"tag": "$tags.k",
"phrase": "$tags.v.word"
},
"count": { "$sum": 1 }
}},
{ "$group": {
"_id": "$_id.tag",
"tag_list": {
"$push": {
"count": "$count",
"phrase": "$_id.phrase"
}
}
}}
])
Again as per the note, since tags is in fact an object then what you actually need in order to collect data based on it's sub-keys as the question is asking, is to turn that essentially into an array of items.
The usage of $replaceRoot in your current pipeline would seem to indicate that $objectToArray is of fair use here, as it is available from later patch releases of MongoDB 3.4, being the bare minimal version you should be using in production right now.
That $objectToArray actually does pretty much what the name says and produces an array ( or "list" to be more pythonic ) of entries broken into key and value pairs. These are essentially a "list" of objects ( or "dict" entries ) which have the keys k and v respectively. The output of the first pipeline stage would look like this on the supplied document:
{
"book_id": "5cf172220fb516f706d00591",
"tags": [
{
"k": "person",
"v": [
{
"start_match": 209,
"length_match": 6,
"word": "kimmel"
}
]
}, {
"k": "organization",
"v": [
{
"start_match": 107,
"length_match": 12,
"word": "philadelphia"
}, {
"start_match": 209,
"length_match": 13,
"word": "kimmel center"
}
]
}, {
"k": "location",
"v": [
{
"start_match": 107,
"length_match": 12,
"word": "philadelphia"
}
]
}
],
"deleted" : false
}
So you should be able to see how you can now easily access those k values and use them in grouping, and of course the v is the standard array as well. So it's just the two $unwind stages as shown and then two $group stages. Being the first $group in order to collection over the combination of keys, and the second to collect as per the main grouping key whilst adding the other accumulations to a "list" within that entry.
Of course the output by the above listing is not exactly how you asked for in the question, but the data is basically there. You can optionally add an $addFields or $project stage to essentially rename the _id key as the final aggregation stage:
{ "$addFields": {
"_id": "$$REMOVE",
"tag": "$_id"
}}
Or simply do something pythonic with a little list comprehension on the cursor output:
cursor = db.collection.aggregate([
{ "$addFields": {
"tags": { "$objectToArray": "$tags" }
}},
{ "$unwind": "$tags" },
{ "$unwind": "$tags.v" },
{ "$group": {
"_id": {
"tag": "$tags.k",
"phrase": "$tags.v.word"
},
"count": { "$sum": 1 }
}},
{ "$group": {
"_id": "$_id.tag",
"tag_list": {
"$push": {
"count": "$count",
"phrase": "$_id.phrase"
}
}
}}
])
output = [{ 'tag': doc['_id'], 'tag_list': doc['tag_list'] } for doc in cursor]
print({ 'response': output });
And final output as a "list" you can use for response:
{
"tag_list": [
{
"count": 1,
"phrase": "philadelphia"
}
],
"tag": "location"
},
{
"tag_list": [
{
"count": 1,
"phrase": "kimmel"
}
],
"tag": "person"
},
{
"tag_list": [
{
"count": 1,
"phrase": "kimmel center"
}, {
"count": 1,
"phrase": "philadelphia"
}
],
"tag": "organization"
}
Noting that using a list comprehension approach you have a bit more control over the order of "keys" as output, as MongoDB itself would simply append NEW key names in a projection keeping existing keys ordered first. If that sort of thing is important to you that is. Though it really should not be since all Object/Dict like structures should not be considered to have any set order of keys. That's what arrays ( or lists ) are for.

Iterating over elasticsearch response and creating aggregation function within one ES query

I have an elasticsearch query which returns the top 10 results for a given querystring. I now need to use the response to create a sum aggregation for each of the 10 top results. This is my query to return the top 10:
GET search/
{
"index": "my_index",
"query": {
"match": {
"name": {
"query": "hello world",
"fuzziness": 2
}
}
}
}
With the response from the above request, I generate a list of the 10 org_ids and iterate over each of these ID. I have to make another request using the query below (where "org_id": "12345" is the first element in my array of IDs).
POST _search/my_index
{ "size": 0,
"query": {
"bool": {
"must": [
{
"match": {
"org_id": "12345"
}
}
]
}
},
"aggs": {
"aggregation_1": {
"sum": {
"field": "dollar_amount"
}
},
"aggregation_2": {
"sum": {
"field": "employees"
}
}
}
}
However, I think that this approach is inefficient because I have to make a total of 11 requests which won't scale well. Ideally, I would like to make one request that can do all of this.
Is there any functionality in ES that would make this possible, or would I have to make individual requests for each search parameter? I've looked through the docs and can't find anything that involves iterating over the array of results.
EDIT: For simplicity, I think having 2 requests is fine for now. So I just need to figure out how to pass through an array of org_ids into the 2nd query and do all aggregations in that 2nd query.
E.g.
POST _search/my_index
{ "size": 0,
"query": {
"bool": {
"must": [
{
"match": {
"org_id": ["12345", "67891", "98765"]
}
}
]
}
},
"aggs": {
"aggregation_1": {
"sum": {
"field": "dollar_amount"
}
},
"aggregation_2": {
"sum": {
"field": "employees"
}
}
}
}
To start you can aggregate on one step (so 2 requests in total)
I am taking a look about the fuzziness, but I don't see how make a one shot query.
Edit: are your org_id unique (= id of documents?), can you describe your data (how org_id are linked with the fuzziness query)?
{ "size": 0,
"query": {
"bool": {
"must": [
{
"match": {
"org_id": "12 13 14 15 16 17 18...."
}
}
]
}
},
"aggs": {
"group_org_id": {
"terms": {
"field": "org_id"
}
},
"aggs": {
"aggregation_1": {
"sum": {
"field": "dollar_amount"
}
},
"aggregation_2": {
"sum": {
"field": "employees"
}
}
}
}
}

Elastic Search nested object query

I have a elastic search index collection like below,
"_index":"test",
"_type":"abc",
"_source":{
"file_name":"xyz.ex"
"metadata":{
"format":".ex"
"profile":[
{"date_value" : "2018-05-30T00:00:00",
"key_id" : "1",
"type" : "date",
"value" : [ "30-05-2018" ]
},
{
"key_id" : "2",
"type" : "freetext",
"value" : [ "New york" ]
}
}
Now I need to search for document by matching key_id to its value. (key_id is some field whose value is stored in "value")
Ex. For key_id='1'field, if it's value = "30-05-2018" it should match the above document.
I tried mapping this as a nested object, But I am not able to write query to search with 2 or more key_id matching its respective value.
This is how I would do it. You need to AND together via bool/filter (or bool/must) two nested queries for each of the condition pair, since you want to match two different nested elements from the same parent document.
{
"query": {
"bool": {
"filter": [
{
"nested": {
"path": "metadata.profile",
"query": {
"bool": {
"filter": [
{
"term": {
"metadata.profile.f1": "a"
}
},
{
"term": {
"metadata.profile.f2": true
}
}
]
}
}
}
},
{
"nested": {
"path": "metadata.profile",
"query": {
"bool": {
"filter": [
{
"term": {
"metadata.profile.f1": "b"
}
},
{
"term": {
"metadata.profile.f2": false
}
}
]
}
}
}
}
]
}
}
}

Categories