Sequential Searching Across Multiple Indexes In Elasticsearch - python

Suppose I have Elasticsearch indexes in the following order:
index-2022-04
index-2022-05
index-2022-06
...
index-2022-04 represents the data stored in the month of April 2022, index-2022-05 represents the data stored in the month of May 2022, and so on. Now let's say in my query payload, I have the following timestamp range:
"range": {
"timestampRange": {
"gte": "2022-04-05T01:00:00.708363",
"lte": "2022-06-06T23:00:00.373772"
}
}
The above range states that I want to query the data that exists between the 5th of April till the 6th of May. That would mean that I have to query for the data inside three indexes, index-2022-04, index-2022-05 and index-2022-06. Is there a simple and efficient way of performing this query across those three indexes without having to query for each index one-by-one?
I am using Python to handle the query, and I am aware that I can query across different indexes at the same time (see this SO post). Any tips or pointers would be helpful, thanks.

You simply need to define an alias over your indices and query the alias instead of the indexes and let ES figure out which underlying indexes it needs to visit.
Eventually, for increased search performance, you can also configure index-time sorting on timestampRange, so that if your alias spans a full year of indexes, ES knows to visit only three of them based on the range constraint in your query (2022-04-05 -> 2022-04-05).

Like you wrote, you can simply use a wildcard in and/or pass a list as target index.
The simplest way would be to to just query all of your indices with an asterisk wildcard (e.g. index-* or index-2022-*) as target. You do not need to define an alias for that, you can just use the wildcard in the target string, like so:
from elasticsearch import Elasticsearch
es_client = Elasticsearch('https://elastic.host:9200')
datestring_start = '2022-04-05T01:00:00.708363'
datestring_end = '2022-06-06T23:00:00.373772'
result = es_client.search(
index = 'index-*',
query = { "bool": {
"must": [{
"range": {
"timestampRange": {
"gte": datestring_start,
"lte": datestring_end
}
}
}]
}
})
This will query all indices that match the pattern, but I would expect Elasticsearch to perform some sort of optimization on this. As #Val wrote in his answer, configuring index-time sorting will be beneficial for performance, as it limits the number of documents that should be visited when the index sort and the search sort are the same.
For completeness sake, if you really wanted to pass just the relevant index names to Elasticsearch, another option would be to first figure out on the Python side which sequence of indices you need to query and supply these as a comma-separated list (e.g. ['index-2022-04', 'index-2022-05', 'index-2022-06']) as target. You could e.g. use the Pandas date_range() function to easily generate such a list of indices, like so
from elasticsearch import Elasticsearch
import pandas as pd
es_client = Elasticsearch('https://elastic.host:9200')
datestring_start = '2022-04-05T01:00:00.708363'
datestring_end = '2022-06-06T23:00:00.373772'
months_list = pd.date_range(pd.to_datetime(datestring_start).to_period('M').to_timestamp(), datestring_end, freq='MS').strftime("index-%Y-%m").tolist()
result = es_client.search(
index = months_list,
query = { "bool": {
"must": [{
"range": {
"timestampRange": {
"gte": datestring_start,
"lte": datestring_end
}
}
}]
}
})

Related

How to manipulate and slice multi-dimensional JSON data in Python?

I'm trying to set up a convenient system for storing and analyzing data from experiments. For the data files I use the following JSON format:
{
"sample_id": "",
"timestamp": "",
"other_metadata1": "",
"measurements": {
"type1": {
"timestamp": "",
"other_metadata2": ""
"data": {
"parameter1": [1,2,3],
"parameter2": [4,5,6]
}
}
"type2": { ... }
}
}
Now for analyzing many of these files, I want to filter for sample metadata and measurement metadata to get a subset of the data to plot. I wrote a function like this:
def get_subset(data_dict, include_samples={}, include_measurements={}):
# Start with a copy of all datasets
subset = copy.deepcopy(data_dict)
# Include samples if they satisfy certain properties
for prop, req in include_samples.items():
subset = {file: sample for file, sample in subset.items() if sample[prop] == req}
# Filter by measurement properties
for file, sample in subset.items():
measurements = sample['measurements'].copy()
for prop, req in include_measurements.items():
measurements = [meas for meas in measurements if meas[prop] == req]
# Replace the measurements list
sample['measurements'] = measurements
return subset
While this works, I feel like I'm re-inventing the wheel of something like pandas. I would like to have more functionality like dropping all NaN values, excluding based on metadata, etc, All of which is available in pandas. However my data format is not compatible with the 2D nature of that.
Any suggestions on how to go about manipulating and slicing such data strutures without reinventing a lot of things?

Google Sheets Python batchUpdate repeatCell -> issue with range and number format

I am trying to use the google sheets api for python to format only a specific columns results to a "NUMBER" type but am struggling to get it to work properly. Am I doing something wrong with the "range" block? There are values that are getting appended to the column and when they get appended (via a different api set) they do not come back as formatted numbers that, when highlighting the entire column, result in a numbered sum.
id_sampleforstackoverflow = 'abcdefg123xidjadsfh192810'
cost_sav_body = {
"requests": [
{
"repeatCell": {
"range": {
"sheetId": 0,
"startRowIndex": 2,
"endRowIndex": 6,
"startColumnIndex": 0,
"endColumnIndex": 6
},
"cell": {
"userEnteredFormat": {
"numberFormat": {
"type": "NUMBER",
"pattern": "#.0#;#.0#"
}
}
},
"fields": "userEnteredFormat.numberFormat"
}
}
]
}
cost_sav_sum = service.spreadsheets().batchUpdate(spreadsheetId=id_sampleforstackoverflow, body=cost_sav_body).execute()
So when I run the above with the rest of my code, the values get appended, however, when highlighting the column, it simply gives me a count of the objects, and not a formatted number summing the total of the values (i.e. there are three values of -24, but only see a "Count" of 3 instead of -72).
I am using the GCP recommendations api for machineType to append the cost projection -> costs -> units value to the column (they append for example like i.e. -24).
Can someone help?
Documentation I have already gone through:
https://cloud.google.com/blog/products/application-development/formatting-cells-with-the-google-sheets-api
https://developers.google.com/sheets/api/guides/formats
https://developers.google.com/sheets/api/reference/rest/v4/spreadsheets/other#GridRange
#all
I was able to figure out the problem. When doing straight reporting of the values for the cost (as explained above as an objective) I was converting the output to string using the str() python method. I removed that str() method and kept the rest of the code you see above and now things are posting correctly:
#spend = str(element.primary_impact.cost_projection.cost.units)
spend = element.primary_impact.cost_projection.cost.units
So FYI for anyone else wondering, make sure that str() method is not used if you need to do a custom formatting code to those particular cells!

Geting max value of an array in record - Pymongo

I have a Mongo collection which stores prices for a number of cryptocurrencies.
Each record looks like this:
1. ticker: string 2. dates: array of datetime.datetime 2. prices: array of float
I am trying to get the most recent date for a given ticker. I've tried this Python code:
max_date = date_price_collection.find({"ticker": "BTC"},{"dates":1}).limit(1).sort({"dates",pymongo.ASCENDING})
But it gives the error:
TypeError: if no direction is specified, key_or_list must be an instance of list
How can I get the maximum date for a specific filtered record?
UPDATE:
The following works:
max_date = date_price_collection.aggregate([ {'$match': {"ticker": ticker}},{'$project': {'max':{'$max': '$dates'}}} ])
The following query will return the max date value form the dates array. Note that this syntax (and feature) is available with MongoDB v4.4 and higher only. This feature allows use Aggregation Projection (and its operators) in find queries.
max_date = date_price_collection.find( { "ticker": "BTC" }, { "_id": False, "dates": { "$max": "$dates" } } )

Elasticsearch: match only datefields that are None

I'm trying to return a list of unit id's where the date field is None.
The example below is just a snippet. A company can have several hundred unit id's, but I only want to return a list of active units (where 'validUntil' is None).
'_source': {'company': {'companyId': 1,
{'unit': [{'unitId': 1,
'period': {'validUntil': '2016-02-07' }},
{'unitId': 2,
'period': {'validUntil': None }}]
payload = {
"size": 200,
"_source": "company.companyId.unitId,
"query":{
"term":{
"company.companyId": "1"
}
}
}
I have tried several different things (filter, must_not exists etc.), but either the searches return all unit id's pertaining to that company id or nothing, making me suspect that I'm not filtering correctly.
The date format is 'dateOptionalTime' if that is any help.
It looks like your problem might not be in the query itself.
As far as I know, you cannot return only part of the array if it's type is not nested,
I recommend looking at this question:
select matching objects from array in elasticsearch

Using json_normalize for structured multi level dictionaries with lists

I've successfully transferred the data from a JSON file (structured as per the below example), into a three column ['tag', 'time', 'score'] DataFrame using the following iterative approach:
for k, v in enumerate(my_request['content']):
for k1, v1 in enumerate(v['data']['score']):
df.loc[len(df)] = [v['tag_id'], v1['time'], v1['value']]
However, while this ultimately achieves the desired result, it takes a huge amount of time to iterate through larger files with the same structure. I'm assuming that an iterative approach is not the ideal way to tackle this sort of problem. Using pandas.io.json.json_normalize instead, I've tried the following:
result = json_normalize(my_request, ['content'], ['data', 'score', ['time', 'value']])
Which returns KeyError: ("Try running with errors='ignore' as key %s is not always present", KeyError('data',)). I believe I've misinterpreted the pandas documentation on json_normalize, and can't quite figure out how I should pass the parameters.
Can anyone point me in the right direction?
(alternatively using errors='ignore' returns ValueError: Conflicting metadata name data, need distinguishing prefix.)
JSON Structure
{
'content':[
{
'data':{
'score':[
{
'time':'2015-03-01 00:00:30',
'value':75.0
},
{
'time':'2015-03-01 23:50:30',
'value':58.0
}
]
},
'tag_id':320676
},
{
'data':{
'score':[
{
'time':'2015-03-01 00:00:25',
'value':78.0
},
{
'time':'2015-03-01 00:05:25',
'value':57.0
}
]
},
'tag_id':320677
}
],
'meta':None,
'requested':'2018-04-15 13:00:00'
}
However, while this ultimately achieves the desired result, it takes a huge amount of time to iterate through larger files with the same structure.
I would suggest the following:
Check whether the problem is with your iterated appends. Pandas is not very good at sequentially adding rows. How about this code:
tups = []
for k, v in enumerate(my_request['content']):
for k1, v1 in enumerate(v['data']['score']):
tups.append(v['tag_id'], v1['time'], v1['value'])
df = pd.DataFrame(tups, columns=['tag_id', 'time', 'value])
If the preceding is not fast enough, check if it's the JSON-parsing part with
for k, v in enumerate(my_request['content']):
for k1, v1 in enumerate(v['data']['score']):
v['tag_id'], v1['time'], v1['value']
It is probable that 1. will be fast enough. If not, however, check if ujson might be faster for this case.

Categories