Odd behavior with begins_with and a binary column in DynamoDB - python

Summary
When querying a binary range key using begins_with, some results are not returned even though they begin with the value being queried. This appears to only happen with certain values, and only in DynamoDB-local - not the AWS hosted version of DynamoDB.
Here is a gist you can run that reproduces the issue: https://gist.github.com/pbaughman/922db7b51f7f82bbd9634949d71f846b
Details
I have a DynamoDB table with the following schema:
user_id - Primary Key - binary - Contains 16 byte UUID
project_id_item_id - Sort Key - binary - 32 bytes - two UUIDs concatinated
While running my unit tests locally using the dynamodb-local docker image I have observed some bizarre behavior
I've inserted 20 items into my table like this:
table.put_item(
Item={
'user_id': user_id.bytes,
'project_id_item_id': project_id.bytes + item_id.bytes
}
)
Each item has the same user_id and the same project_id with a different item_id.
When I attempt to query the same data back out, sometimes (maybe 1 in 5 times that I run the test) I only get some of the items back out:
table.query(
KeyConditionExpression=
Key('user_id').eq(user_id.bytes) &
Key('project_id_item_id').begins_with(project_id.bytes))
)
# Only returns 14 items
If I drop the 2nd condition from the KeyConditionExpression, I get all 20 items.
If I run a scan instead of a query and use the same condition expression, I get all 20 items
table.scan(
FilterExpression=
Key('user_id').eq(user_id.bytes) &
Key('project_id_item_id').begins_with(project_id.bytes))
)
# 20 items are returned
If I print the project_id_item_id of every item in the table, I can see that they all start with the same project_id:
[i['project_id_item_id'].value.hex() for i in table.scan()['Items']]
# Result:
|---------Project Id-----------|
['76761923aeba4edf9fccb9eeb5f80cc40604481b26c84c73b63308dd588a4df1',
'76761923aeba4edf9fccb9eeb5f80cc40ec926452c294c909befa772b86e2175',
'76761923aeba4edf9fccb9eeb5f80cc460ff943b36ec44518175525d6eb30480',
'76761923aeba4edf9fccb9eeb5f80cc464e427afe84d49a5b3f890f9d25ee73b',
'76761923aeba4edf9fccb9eeb5f80cc466f3bfd77b14479a8977d91af1a5fa01',
'76761923aeba4edf9fccb9eeb5f80cc46cd5b7dec9514714918449f8b49cbe4e',
'76761923aeba4edf9fccb9eeb5f80cc47d89f44aae584c1c9da475392cb0a085',
'76761923aeba4edf9fccb9eeb5f80cc495f85af4d1f142608fae72e23f54cbfb',
'76761923aeba4edf9fccb9eeb5f80cc496374432375a498b937dec3177d95c1a',
'76761923aeba4edf9fccb9eeb5f80cc49eba93584f964d13b09fdd7866a5e382',
'76761923aeba4edf9fccb9eeb5f80cc4a6086f1362224115b7376bc5a5ce66b8',
'76761923aeba4edf9fccb9eeb5f80cc4b5c6872aa1a84994b6f694666288b446',
'76761923aeba4edf9fccb9eeb5f80cc4be07cd547d804be4973041cfd1529734',
'76761923aeba4edf9fccb9eeb5f80cc4c48daab011c449f993f061da3746a660',
'76761923aeba4edf9fccb9eeb5f80cc4d09bc44973654f39b95a91eb3e291c68',
'76761923aeba4edf9fccb9eeb5f80cc4d0edda3d8c6643ad8e93afe2f1b518d4',
'76761923aeba4edf9fccb9eeb5f80cc4d8d1f6f4a85e47d78e2d06ec1938ee2a',
'76761923aeba4edf9fccb9eeb5f80cc4dc7323adfa35423fba15f77facb9a41b',
'76761923aeba4edf9fccb9eeb5f80cc4f948fb40873b425aa644f220cdcb5d4b',
'76761923aeba4edf9fccb9eeb5f80cc4fc7f0583f593454d92a8a266a93c6fcd']
As a sanity check, here is the project_id I'm using in my query:
print(project_id)
76761923-aeba-4edf-9fcc-b9eeb5f80cc4 # Matches what's returned by scan above
Finally, the most bizarre part is I can try to match fewer bytes of the project ID and I start to see all 20 items, then zero items, then all 20 items again:
hash_key = Key('hash_key').eq(hash_key)
for n in range(1,17):
short_key = project_id.bytes[:n]
range_key = Key('project_id_item_id').begins_with(short_key)
count = table.query(KeyConditionExpression=hash_key & range_key)['Count']
print("If I only query for 0x{:32} I find {} items".format(short_key.hex(), count))
Gets me:
If I only query for 0x76 I find 20 items
If I only query for 0x7676 I find 20 items
If I only query for 0x767619 I find 20 items
If I only query for 0x76761923 I find 20 items
If I only query for 0x76761923ae I find 20 items
If I only query for 0x76761923aeba I find 20 items
If I only query for 0x76761923aeba4e I find 20 items
If I only query for 0x76761923aeba4edf I find 0 items
If I only query for 0x76761923aeba4edf9f I find 20 items
If I only query for 0x76761923aeba4edf9fcc I find 0 items
If I only query for 0x76761923aeba4edf9fccb9 I find 20 items
If I only query for 0x76761923aeba4edf9fccb9ee I find 0 items
If I only query for 0x76761923aeba4edf9fccb9eeb5 I find 20 items
If I only query for 0x76761923aeba4edf9fccb9eeb5f8 I find 20 items
If I only query for 0x76761923aeba4edf9fccb9eeb5f80c I find 20 items
If I only query for 0x76761923aeba4edf9fccb9eeb5f80cc4 I find 15 items
I am totally dumbfounded by this pattern. If the range key I'm searching for is 8, 10 or 12 bytes long I get no matches. If it's 16 bytes long I get fewer than 20 but more than 0 matches.
Does anybody have any idea what could be going on here? The documentation indicates that the begins_with expression works with Binary data. I'm totally at a loss as to what could be going wrong. I wonder if DynamoDB-local is doing something like converting the binary data to strings internally to do the comparisons and some of these binary patterns don't convert correctly.
It seems like it might be related to the project_id UUID. If I hard-code it to 76761923-aeba-4edf-9fcc-b9eeb5f80cc4 in the test, I can make it miss items every time.

This may be a six year old bug in DynamoDB local I will leave this question open in case someone has more insight, and I will update this answer if I'm able to find out more information from Amazon.
Edit: As of June 23rd, they have managed to reproduce the issue and it is in the queue to be fixed in a future release.
2nd Edit: As of August 4th, they are investigating the issue and a fix will be released shortly

Related

How to overwrite older existing ID's when merging into new table?

I currently am cacheing data from an API by storing all data to a temporary table and merging into a non-temp table where ID/UPDATED_AT is unique.
ID/UPDATED_AT example:
MERGE
INTO vet_data_patients_stg
USING vet_data_patients_temp_stg
ON vet_data_patients_stg.updated_at=vet_data_patients_temp_stg.updated_at
AND vet_data_patients_stg.id=vet_data_patients_temp_stg.id
WHEN NOT matched THEN
INSERT
(
id,
updated_at,
<<<my_other_fields>>>
)
VALUES
(
vet_data_patients_temp_stg.id,
vet_data_patients_temp_stg.updated_at,
<<<my_other_fields>>>
)
My issue is that this method will leave older ID's/UPDATED_AT's also in the table, but I only want the ID with the most recent UPDATED_AT, to remove the older UPDATED_AT's, and only have unique ID's in the table.
Can I accomplish this by modifying my merge statement?
My python way of auto-generating the string is:
merge_string = f'MERGE INTO {str.upper(tablex)}_{str.upper(envx)}
USING {str.upper(tablex)}_TEMP_{str.upper(envx)}
ON '+' AND '.join(f'{str.upper(tablex)}_{str.upper(envx)}.{x}={str.upper(tablex)}_TEMP_{str.upper(envx)}.{x}' for x in keysx) + f'
WHEN NOT MATCHED THEN INSERT ({field_columnsx})
VALUES ' + '(' + ','.join(f'{str.upper(tablex)}_TEMP_{str.upper(envx)}.{x}' for x in fieldsx) + ')'
EDIT - Examples to more clearly illustrate goal -
So if my TABLE_STG has:
ID|UPDATED_AT|FIELD
0|2018-01-01|X
1|2020-01-01|A
2|2020-02-01|B
And my API gets the following in TABLE_TEMP_STG:
ID|UPDATED_AT|FIELD
1|2020-02-01|A
2|2020-02-01|B
I currently end up with:
ID|UPDATED_AT|FIELD
0|2018-01-01|X
1|2020-01-01|A
1|2020-02-01|A
2|2020-02-01|B
But I really want tp remove the older updated_at's and end up with:
ID|UPDATED_AT|FIELD
0|2018-01-01|X
1|2020-02-01|A
2|2020-02-01|B
We can do deletes in the MATCHED branch of a MERGE statement. Your code needs to look like this:
MERGE
INTO vet_data_patients_stg
USING vet_data_patients_temp_stg
ON vet_data_patients_stg.updated_at=vet_data_patients_temp_stg.updated_at
AND vet_data_patients_stg.id=vet_data_patients_temp_stg.id
WHEN NOT matched THEN
INSERT
(
id,
updated_at,
<<<my_other_fields>>>
)
VALUES
(
vet_data_patients_temp_stg.id,
vet_data_patients_temp_stg.updated_at,
<<<my_other_fields>>>
)
WHEN matched THEN
UPDATE
SET some_other_field = vet_data_patients_temp_stg.some_other_field
DELETE WHERE 1 = 1
This will delete all the rows which are updated, that is all the updated rows.
Note that you need to include the UPDATE clause even though you want to delete all of them. The DELETE logic is applied only to records which are updated, but the syntax doesn't allow us to leave it out.
There is a proof of concept on db<>fiddle.
Re-writing the python code to generate this statement is left as an exercise for the reader :)
The Seeker hasn't posted a representative test case providing sample sets of input data and a desired outcome derived from those samples. So it may be that this doesn't do what they are expecting.

Problem when deleting records from mongodb using pymongo

So I have some 50 document ID's. My python veno list contains document ID's as shown below.
5ddfc565bd293f3dbf502789
5ddfc558bd293f3dbf50263b
5ddfc558bd293f3dbf50264f
5ddfc558bd293f3dbf50264d
5ddfc565bd293f3dbf502792
But when I am trying to delete those 50 document ID's then I am finding a hard time. Let me explain - I need to run my python script over and over again in order to delete all the 50 documents. The first time I run my script it will delete some 10, the next time I run then it deletes 18 and so on. My for loop is pretty simple as shown below
for i in veno:
vv = i[0]
db.Products2.delete_many({'_id': ObjectId(vv)})
If your list is just the ids, then you want:
for i in veno:
db.Products2.delete_many({'_id': ObjectId(i)})
full example:
from pymongo import MongoClient
from bson import ObjectId
db = MongoClient()['testdatabase']
# Test data setup
veno = [str(db.testcollection.insert_one({'a': 1}).inserted_id) for _ in range(50)]
# Quick peek to see we have the data correct
for x in range(3): print(veno[x])
print(f'Document count before delete: {db.testcollection.count_documents({})}')
for i in veno:
db.testcollection.delete_many({'_id': ObjectId(i)})
print(f'Document count after delete: {db.testcollection.count_documents({})}')
gives:
5ddffc5ac9a13622dbf3d88e
5ddffc5ac9a13622dbf3d88f
5ddffc5ac9a13622dbf3d890
Document count before delete: 50
Document count after delete: 0
I dont have any mongo instance to test but what about
veno = [
'5ddfc565bd293f3dbf502789',
'5ddfc558bd293f3dbf50263b',
'5ddfc558bd293f3dbf50264f',
'5ddfc558bd293f3dbf50264d',
'5ddfc565bd293f3dbf502792',
]
# Or for your case (Whatever you have in **veno**)
veno = [vv[0] for vv in veno]
####
db.Products2.delete_many({'_id': {'$in':[ObjectId(vv) for vv in veno]}})
If this doesnt work, then maybe this
db.Products2.remove({'_id': {'$in':[ObjectId(vv) for vv in veno]}})
From what I understand, delete_many's first argument is filter, so
its designed in such a way, that you dont delete particular documents
but instead documents that satisfies particular condition.
In above case, best is delete all documents at once by saying -> delete all documents whose _id is in ($in) the list [ObjectId(vv) for vv in veno]

elasticsearch-dsl aggregations returns only 10 results. How to change this

I am using elasticsearch-dsl python library to connect to elasticsearch and do aggregations.
I am following code
search.aggs.bucket('per_date', 'terms', field='date')\
.bucket('response_time_percentile', 'percentiles', field='total_time',
percents=percentiles, hdr={"number_of_significant_value_digits": 1})
response = search.execute()
This works fine but returns only 10 results in response.aggregations.per_ts.buckets
I want all the results
I have tried one solution with size=0 as mentioned in this question
search.aggs.bucket('per_ts', 'terms', field='ts', size=0)\
.bucket('response_time_percentile', 'percentiles', field='total_time',
percents=percentiles, hdr={"number_of_significant_value_digits": 1})
response = search.execute()
But this results in error
TransportError(400, u'parsing_exception', u'[terms] failed to parse field [size]')
I had the same issue. I finally found this solution:
s = Search(using=client, index="jokes").query("match", jks_content=keywords).extra(size=0)
a = A('terms', field='jks_title.keyword', size=999999)
s.aggs.bucket('by_title', a)
response = s.execute()
After 2.x, size=0 for all bucket results won't work anymore, please refer to this thread. Here in my example I just set the size equal 999999. You can pick a large number according to your case.
It is recommended to explicitly set reasonable value for size a number
between 1 to 2147483647.
Hope this helps.
This is a bit older but I ran into the same issue. What I wanted was basically an iterator that i could use to go through all aggregations that i got back (i also have a lot of unique results).
The best thing i found is to create a python generator like this
def scan_aggregation_results():
i=0
partitions=20
while i < partitions:
s = Search(using=elastic, index='my_index').extra(size=0)
agg = A('terms', field='my_field.keyword', size=999999,
include={"partition": i, "num_partitions": partitions})
s.aggs.bucket('my_agg', agg)
result = s.execute()
for item in result.aggregations.my_agg.buckets:
yield my_field.key
i = i + 1
# in other parts of the code just do
for item in scan_aggregation_results():
print(item) # or do whatever you want with it
The magic here is that elastic will automatically partition the number of results by 20, ie the number of partitions i define. I just have to set the size to something large enough to hold a single partition, in this case the result can be up to 20 million items large (or 20*999999). If you have much less items, like me, to return (like 20000) then you will just have 1000 results per query in your bucket, regardless that you defined a much larger size.
Using the generator construct as outlined above you can then even get rid of that and create your own scanner so to speak, iterating over all results individually, just what i wanted.
You should read the documentation.
So in your case, this should be like this :
search.aggs.bucket('per_date', 'terms', field='date')\
.bucket('response_time_percentile', 'percentiles', field='total_time',
percents=percentiles, hdr={"number_of_significant_value_digits": 1})[0:50]
response = search.execute()

Filtering a set of data based on indices in line

I have a python script that pulls data from an external servers SQL database and sum's the values based on transaction numbers. I've gotten some assistance in cleaning up the result sets - which have been a huge help, but now I've hit another problem.
My original query:
SELECT th.trans_ref_no, th.doc_no, th.folio_yr, th.folio_mo, th.transaction_date, tc.prod_id, tc.gr_gals FROM TransHeader th, TransComponents tc WHERE th.term_id="%s" and th.source="L" and th.folio_yr="%s" and th.folio_mo="%s" and (tc.prod_id="TEXLED" or tc.prod_id="103349" or tc.prod_id="103360" or tc.prod_id="103370" or tc.prod_id="113107" or tc.prod_id="113093")and th.trans_ref_no=tc.trans_ref_no;
Returns a set of data that I've copied a snippet here:
"0520227370","0001063257","2014","01","140101","113107","000002000"
"0520227370","0001063257","2014","01","140101","TEXLED","000002550"
"0520227378","0001063265","2014","01","140101","113107","000001980"
"0520227378","0001063265","2014","01","140101","TEXLED","000002521"
"0520227380","0001063267","2014","01","140101","113107","000001500"
"0520227380","0001063267","2014","01","140101","TEXLED","000001911"
"0520227384","0001063271","2014","01","140101","113107","000003501"
"0520227384","0001063271","2014","01","140101","TEXLED","000004463"
"0520227384","0001063271","2014","01","140101","113107","000004000"
"0520227384","0001063271","2014","01","140101","TEXLED","000005103"
"0520227385","0001063272","2014","01","140101","113107","000007500"
"0520227385","0001063272","2014","01","140101","TEXLED","000009565"
"0520227388","0001063275","2014","01","140101","113107","000002000"
"0520227388","0001063275","2014","01","140101","TEXLED","000002553"
The updated query runs this twice and JOINS the trans_ref_no, which is the first position in the result set, so the first 6 lines get condensed into three, and the last four lines get condensed into two. The problem I'm having is getting transaction number 0520227384 to get condensed to two lines.
SELECT t1.trans_ref_no, t1.doc_no, t1.folio_yr, t1.folio_mo, t1.transaction_date, t1.prod_id, t1.gr_gals, t2.prod_id, t2.gr_gals FROM (SELECT th.trans_ref_no, th.doc_no, th.folio_yr, th.folio_mo, th.transaction_date, tc.prod_id, tc.gr_gals FROM Tms6Data.TransHeader th, Tms6Data.TransComponents tc WHERE th.term_id="00000MA" and th.source="L" and th.folio_yr="2014" and th.folio_mo="01" and (tc.prod_id="103349" or tc.prod_id="103360" or tc.prod_id="103370" or tc.prod_id="113107" or tc.prod_id="113093") and th.trans_ref_no=tc.trans_ref_no) t1 JOIN (SELECT th.trans_ref_no, th.doc_no, th.folio_yr, th.folio_mo, th.transaction_date, tc.prod_id, tc.gr_gals FROM Tms6Data.TransHeader th, Tms6Data.TransComponents tc WHERE th.term_id="00000MA" and th.source="L" and th.folio_yr="2014" and th.folio_mo="01" and tc.prod_id="TEXLED" and th.trans_ref_no=tc.trans_ref_no) t2 ON t1.trans_ref_no = t2.trans_ref_no;
Here is what the new query returns for transaction number 0520227384:
"0520227384","0001063271","2014","01","140101","113107","000003501","TEXLED","000004463"
"0520227384","0001063271","2014","01","140101","113107","000003501","TEXLED","000005103"
"0520227384","0001063271","2014","01","140101","113107","000004000","TEXLED","000004463"
"0520227384","0001063271","2014","01","140101","113107","000004000","TEXLED","000005103"
What I need to get out of this is a set of condensed lines where, in this group, the seconds and third need to be removed:
"0520227384","0001063271","2014","01","140101","113107","000003501","TEXLED","000004463"
"0520227384","0001063271","2014","01","140101","113107","000004000","TEXLED","000005103"
How can I go about filtering these lines from the updated query result set?
i think, the answer is:
(... your heavy sql ..) group by 7
or
(... your heavy sql ..) group by t1.gr_gals

Simple example of retrieving 500 items from dynamodb using Python

Looking for a simple example of retrieving 500 items from dynamodb minimizing the number of queries. I know there's a "multiget" function that would let me break this up into chunks of 50 queries, but not sure how to do this.
I'm starting with a list of 500 keys. I'm then thinking of writing a function that takes this list of keys, breaks it up into "chunks," retrieves the values, stitches them back together, and returns a dict of 500 key-value pairs.
Or is there a better way to do this?
As a corollary, how would I "sort" the items afterwards?
Depending on you scheme, There are 2 ways of efficiently retrieving your 500 items.
1 Items are under the same hash_key, using a range_key
Use the query method with the hash_key
you may ask to sort the range_keys A-Z or Z-A
2 Items are on "random" keys
You said it: use the BatchGetItem method
Good news: the limit is actually 100/request or 1MB max
you will have to sort the results on the Python side.
On the practical side, since you use Python, I highly recommend the Boto library for low-level access or dynamodb-mapper library for higher level access (Disclaimer: I am one of the core dev of dynamodb-mapper).
Sadly, neither of these library provides an easy way to wrap the batch_get operation. On the contrary, there is a generator for scan and for query which 'pretends' you get all in a single query.
In order to get optimal results with the batch query, I recommend this workflow:
submit a batch with all of your 500 items.
store the results in your dicts
re-submit with the UnprocessedKeys as many times as needed
sort the results on the python side
Quick example
I assume you have created a table "MyTable" with a single hash_key
import boto
# Helper function. This is more or less the code
# I added to devolop branch
def resubmit(batch, prev):
# Empty (re-use) the batch
del batch[:]
# The batch answer contains the list of
# unprocessed keys grouped by tables
if 'UnprocessedKeys' in prev:
unprocessed = res['UnprocessedKeys']
else:
return None
# Load the unprocessed keys
for table_name, table_req in unprocessed.iteritems():
table_keys = table_req['Keys']
table = batch.layer2.get_table(table_name)
keys = []
for key in table_keys:
h = key['HashKeyElement']
r = None
if 'RangeKeyElement' in key:
r = key['RangeKeyElement']
keys.append((h, r))
attributes_to_get = None
if 'AttributesToGet' in table_req:
attributes_to_get = table_req['AttributesToGet']
batch.add_batch(table, keys, attributes_to_get=attributes_to_get)
return batch.submit()
# Main
db = boto.connect_dynamodb()
table = db.get_table('MyTable')
batch = db.new_batch_list()
keys = range (100) # Get items from 0 to 99
batch.add_batch(table, keys)
res = batch.submit()
while res:
print res # Do some usefull work here
res = resubmit(batch, res)
# The END
EDIT:
I've added a resubmit() function to BatchList in Boto develop branch. It greatly simplifies the worklow:
add all of your requested keys to BatchList
submit()
resubmit() as long as it does not return None.
this should be available in next release.

Categories