Elasticsearch: match only datefields that are None - python

I'm trying to return a list of unit id's where the date field is None.
The example below is just a snippet. A company can have several hundred unit id's, but I only want to return a list of active units (where 'validUntil' is None).
'_source': {'company': {'companyId': 1,
{'unit': [{'unitId': 1,
'period': {'validUntil': '2016-02-07' }},
{'unitId': 2,
'period': {'validUntil': None }}]
payload = {
"size": 200,
"_source": "company.companyId.unitId,
"query":{
"term":{
"company.companyId": "1"
}
}
}
I have tried several different things (filter, must_not exists etc.), but either the searches return all unit id's pertaining to that company id or nothing, making me suspect that I'm not filtering correctly.
The date format is 'dateOptionalTime' if that is any help.

It looks like your problem might not be in the query itself.
As far as I know, you cannot return only part of the array if it's type is not nested,
I recommend looking at this question:
select matching objects from array in elasticsearch

Related

Sequential Searching Across Multiple Indexes In Elasticsearch

Suppose I have Elasticsearch indexes in the following order:
index-2022-04
index-2022-05
index-2022-06
...
index-2022-04 represents the data stored in the month of April 2022, index-2022-05 represents the data stored in the month of May 2022, and so on. Now let's say in my query payload, I have the following timestamp range:
"range": {
"timestampRange": {
"gte": "2022-04-05T01:00:00.708363",
"lte": "2022-06-06T23:00:00.373772"
}
}
The above range states that I want to query the data that exists between the 5th of April till the 6th of May. That would mean that I have to query for the data inside three indexes, index-2022-04, index-2022-05 and index-2022-06. Is there a simple and efficient way of performing this query across those three indexes without having to query for each index one-by-one?
I am using Python to handle the query, and I am aware that I can query across different indexes at the same time (see this SO post). Any tips or pointers would be helpful, thanks.
You simply need to define an alias over your indices and query the alias instead of the indexes and let ES figure out which underlying indexes it needs to visit.
Eventually, for increased search performance, you can also configure index-time sorting on timestampRange, so that if your alias spans a full year of indexes, ES knows to visit only three of them based on the range constraint in your query (2022-04-05 -> 2022-04-05).
Like you wrote, you can simply use a wildcard in and/or pass a list as target index.
The simplest way would be to to just query all of your indices with an asterisk wildcard (e.g. index-* or index-2022-*) as target. You do not need to define an alias for that, you can just use the wildcard in the target string, like so:
from elasticsearch import Elasticsearch
es_client = Elasticsearch('https://elastic.host:9200')
datestring_start = '2022-04-05T01:00:00.708363'
datestring_end = '2022-06-06T23:00:00.373772'
result = es_client.search(
index = 'index-*',
query = { "bool": {
"must": [{
"range": {
"timestampRange": {
"gte": datestring_start,
"lte": datestring_end
}
}
}]
}
})
This will query all indices that match the pattern, but I would expect Elasticsearch to perform some sort of optimization on this. As #Val wrote in his answer, configuring index-time sorting will be beneficial for performance, as it limits the number of documents that should be visited when the index sort and the search sort are the same.
For completeness sake, if you really wanted to pass just the relevant index names to Elasticsearch, another option would be to first figure out on the Python side which sequence of indices you need to query and supply these as a comma-separated list (e.g. ['index-2022-04', 'index-2022-05', 'index-2022-06']) as target. You could e.g. use the Pandas date_range() function to easily generate such a list of indices, like so
from elasticsearch import Elasticsearch
import pandas as pd
es_client = Elasticsearch('https://elastic.host:9200')
datestring_start = '2022-04-05T01:00:00.708363'
datestring_end = '2022-06-06T23:00:00.373772'
months_list = pd.date_range(pd.to_datetime(datestring_start).to_period('M').to_timestamp(), datestring_end, freq='MS').strftime("index-%Y-%m").tolist()
result = es_client.search(
index = months_list,
query = { "bool": {
"must": [{
"range": {
"timestampRange": {
"gte": datestring_start,
"lte": datestring_end
}
}
}]
}
})

Google Sheets Python batchUpdate repeatCell -> issue with range and number format

I am trying to use the google sheets api for python to format only a specific columns results to a "NUMBER" type but am struggling to get it to work properly. Am I doing something wrong with the "range" block? There are values that are getting appended to the column and when they get appended (via a different api set) they do not come back as formatted numbers that, when highlighting the entire column, result in a numbered sum.
id_sampleforstackoverflow = 'abcdefg123xidjadsfh192810'
cost_sav_body = {
"requests": [
{
"repeatCell": {
"range": {
"sheetId": 0,
"startRowIndex": 2,
"endRowIndex": 6,
"startColumnIndex": 0,
"endColumnIndex": 6
},
"cell": {
"userEnteredFormat": {
"numberFormat": {
"type": "NUMBER",
"pattern": "#.0#;#.0#"
}
}
},
"fields": "userEnteredFormat.numberFormat"
}
}
]
}
cost_sav_sum = service.spreadsheets().batchUpdate(spreadsheetId=id_sampleforstackoverflow, body=cost_sav_body).execute()
So when I run the above with the rest of my code, the values get appended, however, when highlighting the column, it simply gives me a count of the objects, and not a formatted number summing the total of the values (i.e. there are three values of -24, but only see a "Count" of 3 instead of -72).
I am using the GCP recommendations api for machineType to append the cost projection -> costs -> units value to the column (they append for example like i.e. -24).
Can someone help?
Documentation I have already gone through:
https://cloud.google.com/blog/products/application-development/formatting-cells-with-the-google-sheets-api
https://developers.google.com/sheets/api/guides/formats
https://developers.google.com/sheets/api/reference/rest/v4/spreadsheets/other#GridRange
#all
I was able to figure out the problem. When doing straight reporting of the values for the cost (as explained above as an objective) I was converting the output to string using the str() python method. I removed that str() method and kept the rest of the code you see above and now things are posting correctly:
#spend = str(element.primary_impact.cost_projection.cost.units)
spend = element.primary_impact.cost_projection.cost.units
So FYI for anyone else wondering, make sure that str() method is not used if you need to do a custom formatting code to those particular cells!

Pulling a value from DataFrame based on another value

I'm playing around with the Facebook Ads API, I've pulled campaign data for one of my campaigns. If I have this dataframe:
[<Insights> {
"actions": [
{
"action_type": "custom_event_abc",
"value": 50
},
{
"action_type": "custom_event_def",
"value": 42
},]
How would I go about getting the value for custom_event_def out?
In my wider results, I first used (df.loc[0]['actions'][1]['value']) in my code which worked, but my issue with that is that custom_event_abc doesn't always appear and so the position of custom_event_defcan change; meaning my solution only works some of the time.
Can value (42) be pulled out using a reference to the action_type?
This will first create a dictionary actions with the content of "actions", iterate through all value it to find custom_event_def and then print the corresponding value
actions = df.loc[0]['actions']
for i, elem in enumerate(actions):
if elem['action_type'] == "custom_event_def":
print(actions[i]['value'])

Extract JSON | API | Pandas DataFrame

I am using the Facebook API (v2.10) to which I've extracted the data I need, 95% of which is perfect. My problem is the 'actions' metric which returns as a dictionary within a list within another dictionary.
At present, all the data is in a DataFrame, however, the 'actions' column is a list of dictionaries that contain each individual action for that day.
{
"actions": [
{
"action_type": "offsite_conversion.custom.xxxxxxxxxxx",
"value": "7"
},
{
"action_type": "offsite_conversion.custom.xxxxxxxxxxx",
"value": "3"
},
{
"action_type": "offsite_conversion.custom.xxxxxxxxxxx",
"value": "144"
},
{
"action_type": "offsite_conversion.custom.xxxxxxxxxxx",
"value": "34"
}]}
All this appears in one cell (row) within the DataFrame.
What is the best way to:
Get the action type, create a new column and use the Use "action_type" as the column name?
List the correct value under this column
It looks like JSON but when I look at the type, it's a panda series (stored as an object).
For those willing to help (thank you, I greatly appreciate it) - can you either point me in the direction of the right material and I will read it and work it out on my own (I'm not entirely sure what to look for) or if you decide this is an easy problem, explain to me how and why you solved it this way. Don't just want the answer
I have tried the following (with help from a friend) and it kind of works, but I have issues with this running in my script. IE: if it runs within a bigger code block, I get the following error:
for i in range(df.shape[0]):
line = df.loc[i, 'Conversions']
L = ast.literal_eval(line)
for l in L:
cid = l['action_type']
value = l['value']
df.loc[i, cid] = value
If I save the DF as a csv, call it using pd.read_csv...it executes properly, but not within the script. No idea why.
Error:
ValueError: malformed node or string: [{'value': '1', 'action_type': 'offsite_conversion.custom.xxxxx}]
Any help would be greatly appreciated.
Thanks,
Adrian
You can use json_normalize:
In [11]: d # e.g. dict from json.load OR instead pass the json path to json_normalize
Out[11]:
{'actions': [{'action_type': 'offsite_conversion.custom.xxxxxxxxxxx',
'value': '7'},
{'action_type': 'offsite_conversion.custom.xxxxxxxxxxx', 'value': '3'},
{'action_type': 'offsite_conversion.custom.xxxxxxxxxxx', 'value': '144'},
{'action_type': 'offsite_conversion.custom.xxxxxxxxxxx', 'value': '34'}]}
In [12]: pd.io.json.json_normalize(d, record_path="actions")
Out[12]:
action_type value
0 offsite_conversion.custom.xxxxxxxxxxx 7
1 offsite_conversion.custom.xxxxxxxxxxx 3
2 offsite_conversion.custom.xxxxxxxxxxx 144
3 offsite_conversion.custom.xxxxxxxxxxx 34
You can use df.join(pd.DataFrame(df['Conversions'].tolist()).pivot(columns='action_type', values='value').reset_index(drop=True)).
Explanation:
df['Conversions'].tolist() returns a list of dictionaries. This list is then transformed into a DataFrame using pd.DataFrame. Then, you can use the pivot function to pivot the table into the shape that you want.
Lastly, you can join the table with your original DataFrame. Note that this only works if you DataFrame's index is the default (i.e., integers starting from 0). If this is not the case, you can do this instead:
df2 = pd.DataFrame(df['Conversions'].tolist()).pivot(columns='action_type', values='value').reset_index(drop=True)
for col in df2.columns:
df[col] = df2[col]

Pymongo / Mongodb Aggregate Tip

I have a collection of documents as follows:
{_id: [unique_id], kids: ['unique_id', 'unique_id', 'unique_id', 'unique_id']}
Although, each document has multiple fields, but I am only concerned about _id and kids, so bringing it to the focus (where _id being the _id of the parent and kids an array of ids corresponding to kids).
I have 2 million plus documents in the collection, and what I am looking for is the best possible (i.e. quickest possible) way to retrieve these records. What I have tried initially for 100k documents goes as follows:
-- Plain aggregation:
t = coll.aggregate([{'$project': {'_id': 1, 'kids': 1}},
{'$limit' : 100000},
{'$group': {'_id': '$_id', 'kids': {'$push': "$kids"}}}
])
This is taking around 85 seconds to aggregate for each _id.
-- Aggregation with a condition:
In some of the documents, the kids field is missing, so to get all the relevant documents I am setting a $match with $exists feature:
t = coll.aggregate([{'$project': {'_id': 1, 'kids': 1}},
{'$match': {'kids': {'$exists': True}}},
{'$limit' : 100000},
{'$group': {'_id': '$_id', 'kids': {'$push': "$kids"}}}
])
This is taking around 190 seconds for 100k records.
-- Recursion with find:
The third technique I am using is to find all the documents and append them to a dictionary, and making sure that none of them repeats (since some kids are also parents)..so beginning with two parents and recursing:
def agg(qtc):
qtc = [_id, _id]
for a in qtc:
for b in coll.find({'kids': {'$exists': True}, '_id': ObjectId(a)}, {'_id': 1, 'kids': 1}):
t.append({'_id': str(a['_id']), 'kids': [str(c) for c in b['kids']]})
t = [dict(_id=d['_id'], kid=v) for d in t for v in d['kids']]
t = [dict(tupleized) for tupleized in set(tuple(item.items()) for item in t)]
#THE ABOVE TWO LINES ARE FOR ASSIGNING EACH ID WITH EACH 'KID', SO THAT, IF AN ID HAS
#FOUR KIDS, THEN THE RESULTING ARRAY CONTAINS FOUR RECORDS FOR THAT ID WITH EACH
#HAVING A SINGLE KID.
for a in flatten(t):
if a['kid'] in qtc:
print 'skipped'
continue
else:
t.extend(similar_once([k['kid'] for k in t]))
return t
for this particular one, the time remains unknown as i am not able to figure how exactly to achieve this.
So, the objective is to get all the kids of all the parents (where some kids are also parents) in a minimum possible time. To mention again, I have tested for upto 100k records, and I have 2 million + records. Any Help would be great. Thanks.
Each document's _id is unique, so grouping by the _id does nothing: unique documents go in, and the same unique documents come out. A simple find does all you need:
for document in coll.find(
{'_id': {'$in': [ parent_id1, parent_id2 ]}},
{'_id': True, 'kids': True}):
print document

Categories