Dynamic queries in Elasticsearch and issue of keywords - python

I'm currently running into a problem, trying to build dynamic queries for Elasticsearch in Python. To make a query I use Q shortсut from elasticsearch_dsl. This is something I try to implement
...
s = Search(using=db, index="reestr")
condition = {"attr_1_":"value 1", "attr_2_":"value 2"} # try to build query from this
must = []
for key in condition:
must.append(Q('match',key=condition[key]))
But that in fact results to this condition:
[Q('match',key="value 1"),Q('match',key="value 2")]
However, what I want is:
[Q('match',attr_1_="value 1"),Q('match',attr_2_="value 2")]
IMHO, the way this library does queries is not effective. I think this syntax:
Q("match","attrubute_name"="attribute_value")
is much more powerful and makes it possible to do a lot more things, than this one:
Q("match",attribute_name="attribute_value")
It seems, as if it is impossible to dynamically build attribute_names. Or it is, of course, possible that I do not know the right way to do it.

Suppose,
filters = {'condition1':['value1'],'condition2':['value3','value4']}
Code:
filters = data['filters_data']
must_and = list() # The condition that has only one value
should_or = list() # The condition that has more than 1 value
for key in filters:
if len(filters[key]) > 1:
for item in filters[key]:
should_or.append(Q("match", **{key:item}))
else:
must_and.append(Q("match", **{key:filters[key][0]}))
q1 = Bool(must=must_and)
q2 = Bool(should=should_or)
s = s.query(q1).query(q2)
result = s.execute()
One can also use terms, that can directly accept the list of values and no need of complicated for loops,
Code:
for key in filters:
must_and.append(Q("terms", **{key:filters[key]}))
q1 = Bool(must=must_and)

Related

Looking for a good way to write an if structure with multiple possibilities

I'm using if statements to retrieve something from the db. But it has multiple conditions and my current implementation works but looking for better options.
data = get(col1=a,
col2=b,
col3=c
)
if not data:
#fall to 2nd option
data = get(col1=a,
col2=b
)
if not data:
#fall to 3rd option
data = get(col1=a,
col3=c
)
if not data:
#fallback to 4th option
data = get(col1=a
)
The get function has code to retrieve rows from mysql database. Is there a better way in python to do this?
This is a bit cleaner, and you can easily add new cases to it:
args = [dict(col1=a,col2=b,col3=c),
dict(col1=a,col2=b),
dict(col1=a,col3=c),
dict(col1=a)]
for arg_dict in args:
data = get( **arg_dict )
if data:
break

Apriori Results in Python

I am trying to run an apriori algorithm in python. My specific problem is when I use the apriori function, I specify the min_length as 2. However, when I print the rules, I get rules that contain only 1 item. I am wondering why apriori does not filter out items less than 2, because I specified I only want rules with 2 things in the itemset.
from apyori import apriori
#store the transactions
transactions = []
total_transactions = 0
with open('browsing.txt', 'r') as file:
for transaction in file:
total_transactions += 1
items = []
for item in transaction.split():
items.append(item)
transactions.append(items)
#
support_threshold = (100/total_transactions)
print(support_threshold)
minimum_support = 100
frequent_items = apriori(transactions, min_length = 2, min_support = support_threshold)
association_results = list(frequent_items)
print(association_results[0])
print(association_results[1])
My results:
RelationRecord(items=frozenset({'DAI11223'}), support=0.004983762579981351, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'DAI11223'}), confidence=0.004983762579981351, lift=1.0)])
RelationRecord(items=frozenset({'DAI11778'}), support=0.0037619369152117293, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'DAI11778'}), confidence=0.0037619369152117293, lift=1.0)])
A look into the code from here: https://github.com/ymoch/apyori/blob/master/apyori.py revealed that there is no min_length keyword (only max_length). They way apyori is implemented it does not raise any warning or error when passing keyword arguments which are not used.
Why not filter the result afterwards?
association_results = filter(lambda x: len(x.items) > 1, association_results)
Limitation of first approach was need to converted data in a list fomat. when we see real life a store has many thousands of sku in that case it is computationally expensive.
Apyori package is outdated. i mean there is no recent update from past few years.
Results are coming in improper format which need to represent properly and that need computational operation to perform.
mlxtend used two way based approach which generate frequent itemset and association rules over that. -check here for more info
mlxtend are proper and has community support.

elasticsearch-dsl aggregations returns only 10 results. How to change this

I am using elasticsearch-dsl python library to connect to elasticsearch and do aggregations.
I am following code
search.aggs.bucket('per_date', 'terms', field='date')\
.bucket('response_time_percentile', 'percentiles', field='total_time',
percents=percentiles, hdr={"number_of_significant_value_digits": 1})
response = search.execute()
This works fine but returns only 10 results in response.aggregations.per_ts.buckets
I want all the results
I have tried one solution with size=0 as mentioned in this question
search.aggs.bucket('per_ts', 'terms', field='ts', size=0)\
.bucket('response_time_percentile', 'percentiles', field='total_time',
percents=percentiles, hdr={"number_of_significant_value_digits": 1})
response = search.execute()
But this results in error
TransportError(400, u'parsing_exception', u'[terms] failed to parse field [size]')
I had the same issue. I finally found this solution:
s = Search(using=client, index="jokes").query("match", jks_content=keywords).extra(size=0)
a = A('terms', field='jks_title.keyword', size=999999)
s.aggs.bucket('by_title', a)
response = s.execute()
After 2.x, size=0 for all bucket results won't work anymore, please refer to this thread. Here in my example I just set the size equal 999999. You can pick a large number according to your case.
It is recommended to explicitly set reasonable value for size a number
between 1 to 2147483647.
Hope this helps.
This is a bit older but I ran into the same issue. What I wanted was basically an iterator that i could use to go through all aggregations that i got back (i also have a lot of unique results).
The best thing i found is to create a python generator like this
def scan_aggregation_results():
i=0
partitions=20
while i < partitions:
s = Search(using=elastic, index='my_index').extra(size=0)
agg = A('terms', field='my_field.keyword', size=999999,
include={"partition": i, "num_partitions": partitions})
s.aggs.bucket('my_agg', agg)
result = s.execute()
for item in result.aggregations.my_agg.buckets:
yield my_field.key
i = i + 1
# in other parts of the code just do
for item in scan_aggregation_results():
print(item) # or do whatever you want with it
The magic here is that elastic will automatically partition the number of results by 20, ie the number of partitions i define. I just have to set the size to something large enough to hold a single partition, in this case the result can be up to 20 million items large (or 20*999999). If you have much less items, like me, to return (like 20000) then you will just have 1000 results per query in your bucket, regardless that you defined a much larger size.
Using the generator construct as outlined above you can then even get rid of that and create your own scanner so to speak, iterating over all results individually, just what i wanted.
You should read the documentation.
So in your case, this should be like this :
search.aggs.bucket('per_date', 'terms', field='date')\
.bucket('response_time_percentile', 'percentiles', field='total_time',
percents=percentiles, hdr={"number_of_significant_value_digits": 1})[0:50]
response = search.execute()

Validating that all components required for an object to exist are present

I need to write a script that gets a list of components from an external source and based on a pre-defined list it validates whether the service is complete. This is needed because the presence of a single component doesn't automatically imply that the service is present - some components are pre-installed even when there is no service. I've devised something really simple below, but I was wondering what is the intelligent way of doing this? There must be a cleaner, simpler way.
# Components that make up a complete service
serviceComponents = ['A','B']
# Input from JSON
data = ['B','A','C']
serviceComplete = True
for i in serviceComponents:
if i in data:
print 'yay ' + i + ' found from ' + ', '.join(service2)
else:
serviceComplete = False
break
# If serviceComplete = True do blabla...
You could do it a few different ways:
set(serviceComponents) <= set(data)
set(serviceComponents).issubset(data)
all(c in data for c in serviceComponents)
You can make it shorter, but you lose readability. What you have now is probably fine. I'd go with the first approach personally, since it expresses your intent clearly with set operations.
# Components that make up a complete service
serviceComponents = ['A','B']
# Input from JSON
data = ['B','A','C']
if all(item in data for item in serviceComponents):
print("All required components are present")
Built-in Set would serve for you, use set.issubset to identify that your required service components is subset of input data:
serviceComponents = set(['A','B'])
input_data = set(['B','A','C'])
if serviceComponents.issubset(input_data):
# perform actions ...

Simple example of retrieving 500 items from dynamodb using Python

Looking for a simple example of retrieving 500 items from dynamodb minimizing the number of queries. I know there's a "multiget" function that would let me break this up into chunks of 50 queries, but not sure how to do this.
I'm starting with a list of 500 keys. I'm then thinking of writing a function that takes this list of keys, breaks it up into "chunks," retrieves the values, stitches them back together, and returns a dict of 500 key-value pairs.
Or is there a better way to do this?
As a corollary, how would I "sort" the items afterwards?
Depending on you scheme, There are 2 ways of efficiently retrieving your 500 items.
1 Items are under the same hash_key, using a range_key
Use the query method with the hash_key
you may ask to sort the range_keys A-Z or Z-A
2 Items are on "random" keys
You said it: use the BatchGetItem method
Good news: the limit is actually 100/request or 1MB max
you will have to sort the results on the Python side.
On the practical side, since you use Python, I highly recommend the Boto library for low-level access or dynamodb-mapper library for higher level access (Disclaimer: I am one of the core dev of dynamodb-mapper).
Sadly, neither of these library provides an easy way to wrap the batch_get operation. On the contrary, there is a generator for scan and for query which 'pretends' you get all in a single query.
In order to get optimal results with the batch query, I recommend this workflow:
submit a batch with all of your 500 items.
store the results in your dicts
re-submit with the UnprocessedKeys as many times as needed
sort the results on the python side
Quick example
I assume you have created a table "MyTable" with a single hash_key
import boto
# Helper function. This is more or less the code
# I added to devolop branch
def resubmit(batch, prev):
# Empty (re-use) the batch
del batch[:]
# The batch answer contains the list of
# unprocessed keys grouped by tables
if 'UnprocessedKeys' in prev:
unprocessed = res['UnprocessedKeys']
else:
return None
# Load the unprocessed keys
for table_name, table_req in unprocessed.iteritems():
table_keys = table_req['Keys']
table = batch.layer2.get_table(table_name)
keys = []
for key in table_keys:
h = key['HashKeyElement']
r = None
if 'RangeKeyElement' in key:
r = key['RangeKeyElement']
keys.append((h, r))
attributes_to_get = None
if 'AttributesToGet' in table_req:
attributes_to_get = table_req['AttributesToGet']
batch.add_batch(table, keys, attributes_to_get=attributes_to_get)
return batch.submit()
# Main
db = boto.connect_dynamodb()
table = db.get_table('MyTable')
batch = db.new_batch_list()
keys = range (100) # Get items from 0 to 99
batch.add_batch(table, keys)
res = batch.submit()
while res:
print res # Do some usefull work here
res = resubmit(batch, res)
# The END
EDIT:
I've added a resubmit() function to BatchList in Boto develop branch. It greatly simplifies the worklow:
add all of your requested keys to BatchList
submit()
resubmit() as long as it does not return None.
this should be available in next release.

Categories