Getting first entry of each unique combination of fields from elasticsearch

Getting first entry of each unique combination of fields from elasticsearch - python

I have an elasticsearch database that I upload entries like this:
{"s_food": "bread", "s_store": "Safeway", "s_date" : "2020-06-30", "l_run" : 28900, "l_covered": 1}
When I upload it to elasticsearch, it adds an _id, _type, #timestamp and _index fields. So the entries look sort of like this:
{"s_food": "bread", "s_store": "Safeway", "s_date" : "2020-06-30", "l_run" : 28900, "l_covered": 1, "_type": "_doc", "_index": "my_index", "_id": pe39u5hs874kee}
The way that I'm using the elasticsearch database results in the same original entries getting uploaded multiple times. In this example, I only care about the s_food, s_date, and l_run fields being a unique combination. Since I have so many entries, I'd like to use the elasticsearch scroll tool to go through all the matches. So far in elasticsearch, I've only seen people use aggregation to get buckets of each term and then they iterate over each partition. I would like to use something like aggregation to get an entire entry (just 1) for each unique combination of the three fields that I care about (food, date, run). Right now I use aggregation with a scroll like so:
GET /my-index/_search?scroll=25m
{
size: 10000,
aggs: {
foods: {
terms: {
field: s_food
},
aggs: {
dates: {
terms: {
field: s_date
},
aggs: {
runs: {
terms: {
field: l_run
}
}
}
}
}
}
}
Unfortunately this is only giving me the usual bucketed structure that I don't want. Is there something else I should try?

All you need is to use top-hits aggregation with size: 1. Read more about top-hits aggregation here.
The query would look like this:
{
"size": 10000,
"aggs": {
"foods": {
"terms": {
"field": "s_food"
},
"aggs": {
"dates": {
"terms": {
"field": "s_date"
},
"aggs": {
"runs": {
"terms": {
"field": "l_run"
},
"aggs": {
"topOne": {
"top_hits": {
"size": 1
}
}
}
}
}
}
}
}
}
}
}

Related

How to query all values of a nested field with Elasticsearch

I would like to query a value in all data packages I have in Elasticsearch.
For example, I have the code :
"website" : "google",
"color" : [
{
"color1" : "red",
"color2" : "blue"
}
]
}
I have this code for an undefined number of website. I want to extract all the "color1" for all the websites I have. How can I do ? I tried with match_all and "size" : 0 but it didn't work.
Thanks a lot !

To be able to query nested object you would need to map them as a nested field first then you can query nested field like this:
GET //my-index-000001/_search
{
"aggs": {
"test": {
"nested": {
"path": "color"
},
"aggs": {
"test2": {
"terms": {
"field": "color.color1"
}
}
}
}
}
}
Your result should look like this for the query:
"aggregations": {
"test": {
"doc_count": 5,
"test2": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "red",
"doc_count": 4
},
{
"key": "gray",
"doc_count": 1
}
]
}
}
}
if you check the aggregation result back you will have list of your color1 with number of time it appeared in your documents.
For more information you can check Elasticsearch official documentation about Nested Field here and Nested aggregation here.

Iterating over elasticsearch response and creating aggregation function within one ES query

I have an elasticsearch query which returns the top 10 results for a given querystring. I now need to use the response to create a sum aggregation for each of the 10 top results. This is my query to return the top 10:
GET search/
{
"index": "my_index",
"query": {
"match": {
"name": {
"query": "hello world",
"fuzziness": 2
}
}
}
}
With the response from the above request, I generate a list of the 10 org_ids and iterate over each of these ID. I have to make another request using the query below (where "org_id": "12345" is the first element in my array of IDs).
POST _search/my_index
{ "size": 0,
"query": {
"bool": {
"must": [
{
"match": {
"org_id": "12345"
}
}
]
}
},
"aggs": {
"aggregation_1": {
"sum": {
"field": "dollar_amount"
}
},
"aggregation_2": {
"sum": {
"field": "employees"
}
}
}
}
However, I think that this approach is inefficient because I have to make a total of 11 requests which won't scale well. Ideally, I would like to make one request that can do all of this.
Is there any functionality in ES that would make this possible, or would I have to make individual requests for each search parameter? I've looked through the docs and can't find anything that involves iterating over the array of results.
EDIT: For simplicity, I think having 2 requests is fine for now. So I just need to figure out how to pass through an array of org_ids into the 2nd query and do all aggregations in that 2nd query.
E.g.
POST _search/my_index
{ "size": 0,
"query": {
"bool": {
"must": [
{
"match": {
"org_id": ["12345", "67891", "98765"]
}
}
]
}
},
"aggs": {
"aggregation_1": {
"sum": {
"field": "dollar_amount"
}
},
"aggregation_2": {
"sum": {
"field": "employees"
}
}
}
}

To start you can aggregate on one step (so 2 requests in total)
I am taking a look about the fuzziness, but I don't see how make a one shot query.
Edit: are your org_id unique (= id of documents?), can you describe your data (how org_id are linked with the fuzziness query)?
{ "size": 0,
"query": {
"bool": {
"must": [
{
"match": {
"org_id": "12 13 14 15 16 17 18...."
}
}
]
}
},
"aggs": {
"group_org_id": {
"terms": {
"field": "org_id"
}
},
"aggs": {
"aggregation_1": {
"sum": {
"field": "dollar_amount"
}
},
"aggregation_2": {
"sum": {
"field": "employees"
}
}
}
}
}

Count instances of a field in ElasticSearch index

I'm curious about the best approach to count the instances of a particular field, across all documents, in a given ElasticSearch index.
For example, if I've got the following documents in index goober:
{
'_id':'foo',
'field1':'a value',
'field2':'a value'
},
{
'_id':'bar',
'field1':'a value',
'field2':'a value'
},
{
'_id':'baz',
'field1':'a value',
'field3':'a value'
}
I'd like to know something like the following:
{
'index':'goober',
'field_counts':
'field1':3,
'field2':2,
'field3':1
}
Is this doable with a single query? or multiple? For what it's worth, I'm using python elasticsearch and elasticsearch-dsl clients.
I've successfully issued a GET request to /goober and retrieved the mappings, and am learning how to submit requests for aggregations for each field, but I'm interested in learning how many times a particular field appears across all documents.
Coming from using Solr, still getting my bearings with ES. Thanks in advance for any suggestions.

The below will return you the count of docs with "field2":
POST /INDEX/_search
{
"size": 0,
"query": {
"bool": {
"filter": {
"exists": {
"field": "field2"
}
}
}
}
}
And here is an example using multiple aggregates (will return each agg in a bucket with a count), using field exist counts:
POST /INDEX/_search
{
"size": 0,
"aggs": {
"field_has1": {
"filter": {
"exists": {
"field": "field1"
}
}
},
"field_has2": {
"filter": {
"exists": {
"field": "field2"
}
}
}
}
}
The behavior within each agg on the second example will mimic the behavior of the first query. In many cases, you can take a regular search query and nest those lookups within aggregate buckets.

Quick time-saver based on existing answer:
interesting_fields = ['field1', 'field2']
body = {
'size': 0,
'aggs': {f'has_{field_name}': {
"filter": {
"exists": {
"field": f'export.{field_name}'
}
}
} for field_name in interesting_fields},
}
print(requests.post('http://localhost:9200/INDEX/_search', json=body).json())

How Iterate or remove MongoDb array list item using pymongo?

I want to iterate Mongodb database Arraylist items(TRANSACTION list) and remove Arraylist specific(TRANSACTION List) item using pymongo ?
I create Mongo collection as above using python pymongo. I want to iterate array list item using pymongo and remove final item only in Arraylist?
Data insert query using Python pymongo
# added new method create block chain_structure
def addCoinWiseTransaction(self, senz, coin, format_date):
self.collection = self.db.block_chain
coinValexists = self.collection.find({"_id": str(coin)}).count()
print('coin exists : ', coinValexists)
if (coinValexists > 0):
print('coin hash exists')
newTransaction = {"$push": {"TRANSACTION": {"SENDER": senz.attributes["#SENDER"],
"RECIVER": senz.attributes["#RECIVER"],
"T_NO_COIN": int(1),
"DATE": datetime.datetime.utcnow()
}}}
self.collection.update({"_id": str(coin)}, newTransaction)
else:
flag = senz.attributes["#f"];
print flag
if (flag == "ccb"):
print('new coin mined othir minner')
root = {"_id": str(coin)
, "S_ID": int(senz.attributes["#S_ID"]), "S_PARA": senz.attributes["#S_PARA"],
"FORMAT_DATE": format_date,
"NO_COIN": int(1),
"TRANSACTION": [{"MINER": senz.attributes["#M_S_ID"],
"RECIVER": senz.attributes["#RECIVER"],
"T_NO_COIN": int(1),
"DATE": datetime.datetime.utcnow()
}
]
}
self.collection.insert(root)
else:
print('new coin mined')
root = {"_id": str(coin)
, "S_ID": int(senz.attributes["#S_ID"]), "S_PARA": senz.attributes["#S_PARA"],
"FORMAT_DATE": format_date,
"NO_COIN": int(1),
"TRANSACTION": [{"MINER": "M_1",
"RECIVER": senz.sender,
"T_NO_COIN": int(1),
"DATE": datetime.datetime.utcnow()
}
]
}
self.collection.insert(root)
return 'DONE'

To remove the last entry, the general idea (as you have mentioned) is to iterate the array and grab the index of the last element as denoted by its DATE field, then update the collection by removing it using $pull. So the crucial piece of data you need for this to work is the DATE value and the document's _id.
One approach you could take is to first use the aggregation framework to get this data. With this, you can run a pipeline where the first step if filtering the documents in the collection by using the $match operator which uses standard MongoDB queries.
The next stage after filtering the documents is to flatten the TRANSACTION array i.e. denormalise the documents in the list so that you can filter the final item i.e. get the last document by the DATE field. This is made possible with the $unwind operator, which for each input document, outputs n documents where n is the number of array elements and can be zero for an empty array.
After deconstructing the array, in order to get the last document, use the $group operator where you can regroup the flattened documents and in the process use the group accumulator operators to obtain
the last TRANSACTION date by using the $max operator applied to its embedded DATE field.
So in essence, run the following pipeline and use the results to update the collection. For example, you can run the following pipeline:
mongo shell
db.block_chain.aggregate([
{ "$match": { "_id": coin_id } },
{ "$unwind": "$TRANSACTION" },
{
"$group": {
"_id": "$_id",
"last_transaction_date": { "$max": "$TRANSACTION.DATE" }
}
}
])
You can then get the document with the update data from this aggregate operation using the toArray() method or the aggregate cursor and update your collection:
var docs = db.block_chain.aggregate([
{ "$match": { "_id": coin_id } },
{ "$unwind": "$TRANSACTION" },
{
"$group": {
"_id": "$_id",
"LAST_TRANSACTION_DATE": { "$max": "$TRANSACTION.DATE" }
}
}
]).toArray()
db.block_chain.updateOne(
{ "_id": docs[0]._id },
{
"$pull": {
"TRANSACTION": {
"DATE": docs[0]["LAST_TRANSACTION_DATE"]
}
}
}
)
python
def remove_last_transaction(self, coin):
self.collection = self.db.block_chain
pipe = [
{ "$match": { "_id": str(coin) } },
{ "$unwind": "$TRANSACTION" },
{
"$group": {
"_id": "$_id",
"last_transaction_date": { "$max": "$TRANSACTION.DATE" }
}
}
]
# run aggregate pipeline
cursor = self.collection.aggregate(pipeline=pipe)
docs = list(cursor)
# run update
self.collection.update_one(
{ "_id": docs[0]["_id"] },
{
"$pull": {
"TRANSACTION": {
"DATE": docs[0]["LAST_TRANSACTION_DATE"]
}
}
}
)
Alternatively, you can run a single aggregate operation that will also update your collection using the $out pipeline which writes the results of the pipeline to the same collection:
If the collection specified by the $out operation already
exists, then upon completion of the aggregation, the $out stage atomically replaces the existing collection with the new results collection. The $out operation does not
change any indexes that existed on the previous collection. If the
aggregation fails, the $out operation makes no changes to
the pre-existing collection.
For example, you could run this pipeline:
mongo shell
db.block_chain.aggregate([
{ "$match": { "_id": coin_id } },
{ "$unwind": "$TRANSACTION" },
{ "$sort": { "TRANSACTION.DATE": 1 } }
{
"$group": {
"_id": "$_id",
"LAST_TRANSACTION": { "$last": "$TRANSACTION" },
"FORMAT_DATE": { "$first": "$FORMAT_DATE" },
"NO_COIN": { "$first": "$NO_COIN" },
"S_ID": { "$first": "$S_ID" },
"S_PARA": { "$first": "$S_PARA" },
"TRANSACTION": { "$push": "$TRANSACTION" }
}
},
{
"$project": {
"FORMAT_DATE": 1,
"NO_COIN": 1,
"S_ID": 1,
"S_PARA": 1,
"TRANSACTION": {
"$setDifference": ["$TRANSACTION", ["$LAST_TRANSACTION"]]
}
}
},
{ "$out": "block_chain" }
])
python
def remove_last_transaction(self, coin):
self.db.block_chain.aggregate([
{ "$match": { "_id": str(coin) } },
{ "$unwind": "$TRANSACTION" },
{ "$sort": { "TRANSACTION.DATE": 1 } },
{
"$group": {
"_id": "$_id",
"LAST_TRANSACTION": { "$last": "$TRANSACTION" },
"FORMAT_DATE": { "$first": "$FORMAT_DATE" },
"NO_COIN": { "$first": "$NO_COIN" },
"S_ID": { "$first": "$S_ID" },
"S_PARA": { "$first": "$S_PARA" },
"TRANSACTION": { "$push": "$TRANSACTION" }
}
},
{
"$project": {
"FORMAT_DATE": 1,
"NO_COIN": 1,
"S_ID": 1,
"S_PARA": 1,
"TRANSACTION": {
"$setDifference": ["$TRANSACTION", ["$LAST_TRANSACTION"]]
}
}
},
{ "$out": "block_chain" }
])
Whilst this approach can be more efficient than the first, it requires knowledge of the existing fields first so in some cases the solution cannot be practical.

PyMongo group by multiple keys

With PyMongo, group by one key seems to be ok:
results = collection.group(key={"scan_status":0}, condition={'date': {'$gte': startdate}}, initial={"count": 0}, reduce=reducer)
results:
{u'count': 215339.0, u'scan_status': u'PENDING'} {u'count': 617263.0, u'scan_status': u'DONE'}
but when I try to do group by multiple keys I get an exception:
results = collection.group(key={"scan_status":0,"date":0}, condition={'date': {'$gte': startdate}}, initial={"count": 0}, reduce=reducer)
How can I do group by multiple fields correctly?

If you are trying to count over two keys then while it is possible using .group() your better option is via .aggregate().
This uses "native code operators" and not the JavaScript interpreted code as required by .group() to do the same basic "grouping" action as you are trying to achieve.
Particularly here is the $group pipeline operator:
result = collection.aggregate([
# Matchn the documents possible
{ "$match": { "date": { "$gte": startdate } } },
# Group the documents and "count" via $sum on the values
{ "$group": {
"_id": {
"scan_status": "$scan_status",
"date": "$date"
},
"count": { "$sum": 1 }
}}
])
In fact you probably want something that reduces the "date" into a distinct period. As in:
result = collection.aggregate([
# Matchn the documents possible
{ "$match": { "date": { "$gte": startdate } } },
# Group the documents and "count" via $sum on the values
{ "$group": {
"_id": {
"scan_status": "$scan_status",
"date": {
"year": { "$year": "$date" },
"month": { "$month" "$date" },
"day": { "$dayOfMonth": "$date" }
}
},
"count": { "$sum": 1 }
}}
])
Using the Date Aggregation Operators as shown here.
Or perhaps with basic "date math":
import datetime
from datetime import date
result = collection.aggregate([
# Matchn the documents possible
{ "$match": { "date": { "$gte": startdate } } },
# Group the documents and "count" via $sum on the values
# use "epoch" "1970-01-01" as a base to convert to integer
{ "$group": {
"_id": {
"scan_status": "$scan_status",
"date": {
"$subtract": [
{ "$subtract": [ "$date", date.fromtimestamp(0) ] },
{ "$mod": [
{ "$subtract": [ "$date", date.fromtimestamp(0) ] },
1000 * 60 * 60 * 24
]}
]
}
},
"count": { "$sum": 1 }
}}
])
Which will return integer values from "epoch" time instead of a compisite value object.
But all of these options are better than .group() as they use native coded routines and perform their actions much faster than the JavaScript code you need to supply otherwise.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Getting first entry of each unique combination of fields from elasticsearch - python

Related

How to query all values of a nested field with Elasticsearch

Iterating over elasticsearch response and creating aggregation function within one ES query

Count instances of a field in ElasticSearch index

How Iterate or remove MongoDb array list item using pymongo?

PyMongo group by multiple keys

Categories

Resources