I have a question about the expected behaviour of Elasticsearch (version 7.9.1) that I'm having a hard time finding the answer to.
I query Elasticsearch with the help of the elasticsearch-dsl (version 7.3.0) library. My code is as follows:
item_search = ItemSearch(search, query_facets)
item_search = item_search[0:9999]
res = item_search.execute()
Here search is a search term for full-text search, and query_facets is a dictionary mapping fields to the terms in the fields.
The ItemSearch class looks like this:
class ItemSearch(FacetedSearch):
doc_types = [ItemsIndex, ]
size = 20
facets = {'language': TermsFacet(field='language.raw', size=size),}
def __init__(self, search, query_facets):
super().__init__(search, query_facets)
def search(self):
s = super(ItemSearch, self).search()
return s
The language field has many thousands of values, but I limited the return size to 20 since we never want to display more results than around that number anyway.
Now onto my actual question: I would expect that if I pass for example {'language' : ["Dutch"]} to ItemSearch as the query_facets parameter, that Elasticsearch returns the count for "Dutch", whether or not it belongs to the top 20 results. However, this is not the case. Is this the expected behaviour, or am I missing something? And if yes, how can I achieve the result I'm after?
Related
I implemented a basic (almost all default options) TF-IDF vectorizer (sklearn) based search program to search for some documents based on user query.
I also tried to implement the same using Whoosh in python. While the standalone tfidfvectorizer implementation returns many intuitive results for the same query, the whoosh query returns only one. (That too vanishes when I try to search in more fields and comes down to 0 results). I want to know what am I doing wrong here?
I have tried setting the scoring in whoosh searcher appropriately according to the whoosh docs.
with myindex.searcher(weighting=scoring.TF_IDF()) as s:
With this, I assume it should give somewhat similar results to the sklearn implementation of TF-IDF vectorizer but instead returns only one hit. How do I get similar results, i.e., make use of something similar to sklearns TF-IDF vectorizer implementation in whoosh.
Also when I use multiple fields to search using
MultifieldParser(["title", "content", "tags", "categories"], ix.schema)
as opposed to only a single field "content", the result is no hits.
Schema:
schema = Schema(id = NUMERIC,
title = TEXT(field_boost=2.0, stored=True, analyzer = StandardAnalyzer(minsize = 1)),
content = TEXT(stored=False, analyzer = StemmingAnalyzer(minsize = 1)),
permalink = ID(stored=True),
tags = KEYWORD(field_boost=2.0,lowercase=True, commas=True, scorable=True, stored = True),
categories = KEYWORD(field_boost=2.0,lowercase=True, commas=True, scorable=True, stored = True),
pub_date = DATETIME(stored = True),
creator = TEXT(stored=False)
)
searching:
writer = ix.writer()
for i in range(len(df)):
writer.add_document(id = df["ID"][i], title = df["Title"][i], content=df["Content"][i],
permalink = df["Permalink"][i], tags = df["Tag"][i], categories = df["Category"][i],
pub_date = df["PubDate"][i], creator = df["Creator"][i])
writer.commit()
with ix.searcher(weighting=scoring.TF_IDF()) as searcher:
parser = MultifieldParser(["title", "content", "tags", "categories"], ix.schema)
query_string = sys.argv[2]
myquery = parser.parse(query_string)
results = searcher.search(myquery, limit = 10, terms = True)
print(len(results))
for i in range(results.scored_length()):
print(results[i])
print()
print("\n")
The code does work and also fetches results. The only problem I am facing is that they seem lacking when compared to the TF-IDF implementation and also return fewer results in most cases (The issue is not the limit attribute in whoosh search). I want to know how to get better results or scoring of results in whoosh along with why it is returning fewer results than the normal implementation.
OUTPUT example for query "How to code?"
TF-IDF (sklearn):
30 Tips to Become Super Effective Software Developers
(Cosine Similarity of 0.3783876779183675 ):
Automation and Continuous Delivery are the bedrock of DevOps
(Cosine Similarity of 0.1476918570123896 ):
Practical Implementation of DevOps Step by Step
(Cosine Similarity of 0.1469115686911894 ):
10 Software Development Frustrations & What You Can Do To Avoid Them!
(Cosine Similarity of 0.13241987064219532 ):
WHOOSH (when searched only in content field. else returns 0 hits):
<Hit {'title': 'Ultimate List of 110 Must Read Software Development Books'}>
EDIT: I just ran the code again and found out that if I remove "?" from query "How to code?" and search only in "title" and "content", it returns quite a few results and they seem better too. Although as soon as I include "tag" and "categories" in the fields to search for, the results go to 0. Why is that?
? is treated as a wildcard. I'm playing with whoosh right now and noticed that:
query = QueryParser("content", ix.schema).parse("one")
I get:
<Top 1 Results for Term('content', 'one') runtime=0.0006002392619848251>
While if I search for one?:
query = QueryParser("content", ix.schema).parse("one?")
I get:
<Top 0 Results for Wildcard('content', 'one?') runtime=0.0002482738345861435>
As you can see in the second example, the returned object is Wildcard. Read more here: https://whoosh.readthedocs.io/en/latest/querylang.html#inexact-terms
I am using elasticsearch-dsl python library to connect to elasticsearch and do aggregations.
I am following code
search.aggs.bucket('per_date', 'terms', field='date')\
.bucket('response_time_percentile', 'percentiles', field='total_time',
percents=percentiles, hdr={"number_of_significant_value_digits": 1})
response = search.execute()
This works fine but returns only 10 results in response.aggregations.per_ts.buckets
I want all the results
I have tried one solution with size=0 as mentioned in this question
search.aggs.bucket('per_ts', 'terms', field='ts', size=0)\
.bucket('response_time_percentile', 'percentiles', field='total_time',
percents=percentiles, hdr={"number_of_significant_value_digits": 1})
response = search.execute()
But this results in error
TransportError(400, u'parsing_exception', u'[terms] failed to parse field [size]')
I had the same issue. I finally found this solution:
s = Search(using=client, index="jokes").query("match", jks_content=keywords).extra(size=0)
a = A('terms', field='jks_title.keyword', size=999999)
s.aggs.bucket('by_title', a)
response = s.execute()
After 2.x, size=0 for all bucket results won't work anymore, please refer to this thread. Here in my example I just set the size equal 999999. You can pick a large number according to your case.
It is recommended to explicitly set reasonable value for size a number
between 1 to 2147483647.
Hope this helps.
This is a bit older but I ran into the same issue. What I wanted was basically an iterator that i could use to go through all aggregations that i got back (i also have a lot of unique results).
The best thing i found is to create a python generator like this
def scan_aggregation_results():
i=0
partitions=20
while i < partitions:
s = Search(using=elastic, index='my_index').extra(size=0)
agg = A('terms', field='my_field.keyword', size=999999,
include={"partition": i, "num_partitions": partitions})
s.aggs.bucket('my_agg', agg)
result = s.execute()
for item in result.aggregations.my_agg.buckets:
yield my_field.key
i = i + 1
# in other parts of the code just do
for item in scan_aggregation_results():
print(item) # or do whatever you want with it
The magic here is that elastic will automatically partition the number of results by 20, ie the number of partitions i define. I just have to set the size to something large enough to hold a single partition, in this case the result can be up to 20 million items large (or 20*999999). If you have much less items, like me, to return (like 20000) then you will just have 1000 results per query in your bucket, regardless that you defined a much larger size.
Using the generator construct as outlined above you can then even get rid of that and create your own scanner so to speak, iterating over all results individually, just what i wanted.
You should read the documentation.
So in your case, this should be like this :
search.aggs.bucket('per_date', 'terms', field='date')\
.bucket('response_time_percentile', 'percentiles', field='total_time',
percents=percentiles, hdr={"number_of_significant_value_digits": 1})[0:50]
response = search.execute()
I give a lot of information on the methods that I used to write my code. If you just want to read my question, skip to the quotes at the end.
I'm working on a project that has a goal of detecting sub populations in a group of patients. I thought this sounded like the perfect opportunity to use association rule mining as I'm currently taking a class on the subject.
I there are 42 variables in total. Of those, 20 are continuous and had to be discretized. For each variable, I used the Freedman-Diaconis rule to determine how many categories to divide a group into.
def Freedman_Diaconis(column_values):
#sort the list first
column_values[1].sort()
first_quartile = int(len(column_values[1]) * .25)
third_quartile = int(len(column_values[1]) * .75)
fq_value = column_values[1][first_quartile]
tq_value = column_values[1][third_quartile]
iqr = tq_value - fq_value
n_to_pow = len(column_values[1])**(-1/3)
h = 2 * iqr * n_to_pow
retval = (column_values[1][-1] - column_values[1][1])/h
test = int(retval+1)
return test
From there I used min-max normalization
def min_max_transform(column_of_data, num_bins):
min_max_normalizer = preprocessing.MinMaxScaler(feature_range=(1, num_bins))
data_min_max = min_max_normalizer.fit_transform(column_of_data[1])
data_min_max_ints = take_int(data_min_max)
return data_min_max_ints
to transform my data and then I simply took the interger portion to get the final categorization.
def take_int(list_of_float):
ints = []
for flt in list_of_float:
asint = int(flt)
ints.append(asint)
return ints
I then also wrote a function that I used to combine this value with the variable name.
def string_transform(prefix, column, index):
transformed_list = []
transformed = ""
if index < 4:
for entry in column[1]:
transformed = prefix+str(entry)
transformed_list.append(transformed)
else:
prefix_num = prefix.split('x')
for entry in column[1]:
transformed = str(prefix_num[1])+'x'+str(entry)
transformed_list.append(transformed)
return transformed_list
This was done to differentiate variables that have the same value, but appear in different columns. For example, having a value of 1 for variable x14 means something different from getting a value of 1 in variable x20. The string transform function would create 14x1 and 20x1 for the previously mentioned examples.
After this, I wrote everything to a file in basket format
def create_basket(list_of_lists, headers):
#for filename in os.listdir("."):
# if filename.e
if not os.path.exists('baskets'):
os.makedirs('baskets')
down_length = len(list_of_lists[0])
with open('baskets/dataset.basket', 'w') as basketfile:
basket_writer = csv.DictWriter(basketfile, fieldnames=headers)
for i in range(0, down_length):
basket_writer.writerow({"trt": list_of_lists[0][i], "y": list_of_lists[1][i], "x1": list_of_lists[2][i],
"x2": list_of_lists[3][i], "x3": list_of_lists[4][i], "x4": list_of_lists[5][i],
"x5": list_of_lists[6][i], "x6": list_of_lists[7][i], "x7": list_of_lists[8][i],
"x8": list_of_lists[9][i], "x9": list_of_lists[10][i], "x10": list_of_lists[11][i],
"x11": list_of_lists[12][i], "x12":list_of_lists[13][i], "x13": list_of_lists[14][i],
"x14": list_of_lists[15][i], "x15": list_of_lists[16][i], "x16": list_of_lists[17][i],
"x17": list_of_lists[18][i], "x18": list_of_lists[19][i], "x19": list_of_lists[20][i],
"x20": list_of_lists[21][i], "x21": list_of_lists[22][i], "x22": list_of_lists[23][i],
"x23": list_of_lists[24][i], "x24": list_of_lists[25][i], "x25": list_of_lists[26][i],
"x26": list_of_lists[27][i], "x27": list_of_lists[28][i], "x28": list_of_lists[29][i],
"x29": list_of_lists[30][i], "x30": list_of_lists[31][i], "x31": list_of_lists[32][i],
"x32": list_of_lists[33][i], "x33": list_of_lists[34][i], "x34": list_of_lists[35][i],
"x35": list_of_lists[36][i], "x36": list_of_lists[37][i], "x37": list_of_lists[38][i],
"x38": list_of_lists[39][i], "x39": list_of_lists[40][i], "x40": list_of_lists[41][i]})
and I used the apriori package in Orange to see if there were any association rules.
rules = Orange.associate.AssociationRulesSparseInducer(patient_basket, support=0.3, confidence=0.3)
print "%4s %4s %s" % ("Supp", "Conf", "Rule")
for r in rules:
my_rule = str(r)
split_rule = my_rule.split("->")
if 'trt' in split_rule[1]:
print 'treatment rule'
print "%4.1f %4.1f %s" % (r.support, r.confidence, r)
Using this, technique I found quite a few association rules with my testing data.
THIS IS WHERE I HAVE A PROBLEM
When I read the notes for the training data, there is this note
...That is, the only
reason for the differences among observed responses to the same treatment across patients is
random noise. Hence, there is NO meaningful subgroup for this dataset...
My question is,
why do I get multiple association rules that would imply that there are subgroups, when according to the notes I shouldn't see anything?
I'm getting lift numbers that are above 2 as opposed to the 1 that you should expect if everything was random like the notes state.
Supp Conf Rule
0.3 0.7 6x0 -> trt1
Even though my code runs, I'm not getting results anywhere close to what should be expected. This leads me to believe that I messed something up, but I'm not sure what it is.
After some research, I realized that my sample size is too small for the number of variables that I have. I would need a way larger sample size in order to really use the method that I was using. In fact, the method that I tried to use was developed with the assumption that it would be run on databases with hundreds of thousands or millions of rows.
Looking for a simple example of retrieving 500 items from dynamodb minimizing the number of queries. I know there's a "multiget" function that would let me break this up into chunks of 50 queries, but not sure how to do this.
I'm starting with a list of 500 keys. I'm then thinking of writing a function that takes this list of keys, breaks it up into "chunks," retrieves the values, stitches them back together, and returns a dict of 500 key-value pairs.
Or is there a better way to do this?
As a corollary, how would I "sort" the items afterwards?
Depending on you scheme, There are 2 ways of efficiently retrieving your 500 items.
1 Items are under the same hash_key, using a range_key
Use the query method with the hash_key
you may ask to sort the range_keys A-Z or Z-A
2 Items are on "random" keys
You said it: use the BatchGetItem method
Good news: the limit is actually 100/request or 1MB max
you will have to sort the results on the Python side.
On the practical side, since you use Python, I highly recommend the Boto library for low-level access or dynamodb-mapper library for higher level access (Disclaimer: I am one of the core dev of dynamodb-mapper).
Sadly, neither of these library provides an easy way to wrap the batch_get operation. On the contrary, there is a generator for scan and for query which 'pretends' you get all in a single query.
In order to get optimal results with the batch query, I recommend this workflow:
submit a batch with all of your 500 items.
store the results in your dicts
re-submit with the UnprocessedKeys as many times as needed
sort the results on the python side
Quick example
I assume you have created a table "MyTable" with a single hash_key
import boto
# Helper function. This is more or less the code
# I added to devolop branch
def resubmit(batch, prev):
# Empty (re-use) the batch
del batch[:]
# The batch answer contains the list of
# unprocessed keys grouped by tables
if 'UnprocessedKeys' in prev:
unprocessed = res['UnprocessedKeys']
else:
return None
# Load the unprocessed keys
for table_name, table_req in unprocessed.iteritems():
table_keys = table_req['Keys']
table = batch.layer2.get_table(table_name)
keys = []
for key in table_keys:
h = key['HashKeyElement']
r = None
if 'RangeKeyElement' in key:
r = key['RangeKeyElement']
keys.append((h, r))
attributes_to_get = None
if 'AttributesToGet' in table_req:
attributes_to_get = table_req['AttributesToGet']
batch.add_batch(table, keys, attributes_to_get=attributes_to_get)
return batch.submit()
# Main
db = boto.connect_dynamodb()
table = db.get_table('MyTable')
batch = db.new_batch_list()
keys = range (100) # Get items from 0 to 99
batch.add_batch(table, keys)
res = batch.submit()
while res:
print res # Do some usefull work here
res = resubmit(batch, res)
# The END
EDIT:
I've added a resubmit() function to BatchList in Boto develop branch. It greatly simplifies the worklow:
add all of your requested keys to BatchList
submit()
resubmit() as long as it does not return None.
this should be available in next release.
I just watched Batch data processing with App Engine session of Google I/O 2010, read some parts of MapReduce article from Google Research and now I am thinking to use MapReduce on Google App Engine to implement a recommender system in Python.
I prefer using appengine-mapreduce instead of Task Queue API because the former offers easy iteration over all instances of some kind, automatic batching, automatic task chaining, etc. The problem is: my recommender system needs to calculate correlation between instances of two different Models, i.e., instances of two distinct kinds.
Example:
I have these two Models: User and Item. Each one has a list of tags as an attribute. Below are the functions to calculate correlation between users and items. Note that calculateCorrelation should be called for every combination of users and items:
def calculateCorrelation(user, item):
return calculateCorrelationAverage(u.tags, i.tags)
def calculateCorrelationAverage(tags1, tags2):
correlationSum = 0.0
for (tag1, tag2) in allCombinations(tags1, tags2):
correlationSum += correlation(tag1, tag2)
return correlationSum / (len(tags1) + len(tags2))
def allCombinations(list1, list2):
combinations = []
for x in list1:
for y in list2:
combinations.append((x, y))
return combinations
But that calculateCorrelation is not a valid Mapper in appengine-mapreduce and maybe this function is not even compatible with MapReduce computation concept. Yet, I need to be sure... it would be really great for me having those appengine-mapreduce advantages like automatic batching and task chaining.
Is there any solution for that?
Should I define my own InputReader? A new InputReader that reads all instances of two different kinds is compatible with the current appengine-mapreduce implementation?
Or should I try the following?
Combine all keys of all entities of these two kinds, two by two, into instances of a new Model (possibly using MapReduce)
Iterate using mappers over instances of this new Model
For each instance, use keys inside it to get the two entities of different kinds and calculate the correlation between them.
Following Nick Johnson suggestion, I wrote my own InputReader. This reader fetch entities from two different kinds. It yields tuples with all combinations of these entities. Here it is:
class TwoKindsInputReader(InputReader):
_APP_PARAM = "_app"
_KIND1_PARAM = "kind1"
_KIND2_PARAM = "kind2"
MAPPER_PARAMS = "mapper_params"
def __init__(self, reader1, reader2):
self._reader1 = reader1
self._reader2 = reader2
def __iter__(self):
for u in self._reader1:
for e in self._reader2:
yield (u, e)
#classmethod
def from_json(cls, input_shard_state):
reader1 = DatastoreInputReader.from_json(input_shard_state[cls._KIND1_PARAM])
reader2 = DatastoreInputReader.from_json(input_shard_state[cls._KIND2_PARAM])
return cls(reader1, reader2)
def to_json(self):
json_dict = {}
json_dict[self._KIND1_PARAM] = self._reader1.to_json()
json_dict[self._KIND2_PARAM] = self._reader2.to_json()
return json_dict
#classmethod
def split_input(cls, mapper_spec):
params = mapper_spec.params
app = params.get(cls._APP_PARAM)
kind1 = params.get(cls._KIND1_PARAM)
kind2 = params.get(cls._KIND2_PARAM)
shard_count = mapper_spec.shard_count
shard_count_sqrt = int(math.sqrt(shard_count))
splitted1 = DatastoreInputReader._split_input_from_params(app, kind1, params, shard_count_sqrt)
splitted2 = DatastoreInputReader._split_input_from_params(app, kind2, params, shard_count_sqrt)
inputs = []
for u in splitted1:
for e in splitted2:
inputs.append(TwoKindsInputReader(u, e))
#mapper_spec.shard_count = len(inputs) #uncomment this in case of "Incorrect number of shard states" (at line 408 in handlers.py)
return inputs
#classmethod
def validate(cls, mapper_spec):
return True #TODO
This code should be used when you need to process all combinations of entities of two kinds. You can also generalize this for more than two kinds.
Here it is a valid the mapreduce.yaml for TwoKindsInputReader:
mapreduce:
- name: recommendationMapReduce
mapper:
input_reader: customInputReaders.TwoKindsInputReader
handler: recommendation.calculateCorrelationHandler
params:
- name: kind1
default: kinds.User
- name: kind2
default: kinds.Item
- name: shard_count
default: 16
It's difficult to know what to recommend without more details of what you're actually calculating. One simple option is to simply fetch the related entity inside the map call - there's nothing preventing you from doing datastore operations there.
This will result in a lot of small calls, though. Writing a custom InputReader, as you suggest, will allow you to fetch both sets of entities in parallel, which will significantly improve performance.
If you give more details as to how you need to join these entities, we may be able to provide more concrete suggestions.