We have an application where the client needs to request some information stored on DynamoDB filtering by date. We have the service created with API Gateway and a Lambda retrieving the corresponding records on DynamoDB, so we have to retrieve all the necessary records in <30segs.
The volume of the records keeps increasing and we have thought in the following:
The client will ask for records in a concrete order (0-100, 100-200, 200-300, etc...) in order to display them on a concrete page on the frontend.
The backend will handle requests (and therefore search on DynamoDB) for that concrete order of records (0-100, 100-200, etc...)
Is there any way on DynamoDB to get the records from a concrete position to a concrete position? Or the only way is to retrieve all the records for that date range and then send the concrete positions to the client?
Thank you in advance,
Best regards.
You don’t specify a schema so I’m going to give you one. :)
Setup a sort key that’s the position number. Then you can efficiently retrieve by position number range.
Or instead of using ordinals if you want to use timestamps then just pass to the client the sort key starting point for their next request and use it as the lowrr value for the next query.
There’s no way to efficiently find the Nth many item in an item collection.
Related
I want to check if a specific key has a specific value in a dynamodb table with/without retrieving the entire item. Is there a way to do this in Python using boto3?
Note: I am looking to match a sort key with its value and check if that specific key value pair exists in the table,
It sounds like you want to fetch an item by it's sort key alone. While this is possible with the scan operation, it's not ideal.
DynamoDB gives us three ways to fetch data: getItem, query and scan.
The getItem operation allows you to fetch a single using it's primary key. The query operation can fetch multiple items within the same partition, but requires you to specify the partition key (and optionally the sort key). The scan operation lets you fetch items by specifying any attribute.
Therefore, if you want to fetch data form DynamoDB without using the full primary key or partition key, you can use the scan operation. However, be careful when using scan. From the docs:
The Scan operation returns one or more items and item attributes by accessing every item in a table or a secondary index.
The scan operation can be horribly inefficient if not used carefully. If you find yourself using scans frequently in your application or in a highly trafficked area of your app, you probably want to reorganize your data model.
What Seth said is 100% accurate, however, if you can add a GSI you can use the query option on the GSI. You could create a GSI that is just the value of the sort key, allowing you to query for records that match that sort key. You can even use the same field, and if you don't need any of the data you can just project the keys, keeping the cost relatively low.
Using Python SDK, could not find how to get all the keys from one bucket
in couchbase.
Docs reference:
http://docs.couchbase.com/sdk-api/couchbase-python-client-2.2.0/api/couchbase.html#item-api-methods
https://github.com/couchbase/couchbase-python-client/tree/master/examples
https://stackoverflow.com/questions/27040667/how-to-get-all-keys-from-couchbase
Is there a simple way to get all the keys ?
I'm a little concerned as to why you would want every single key. The number of documents can get very large, and I can't think of a good reason to want every single key.
That being said, here are a couple of ways to do it in Couchbase:
N1QL. First, create a primary index (CREATE PRIMARY INDEX ON bucketname), then select the keys: SELECT META().id FROM bucketname; In Python, you can use N1QLQuery and N1QLRequest to execute these.
Create a map/reduce view index. Literally the default map function when you create a new map/reduce view index is exactly that: function (doc, meta) { emit(meta.id, null); }. In Python, use the View class.
You don't need Python to do these things, by the way, but you can use it if you'd like. Check out the documentation for the Couchbase Python SDK for more information.
I'm a little concerned as to why you would want every single key. The number of documents can get very large, and I can't think of a good reason to want every single key.
There is a document for every customer with the key being the username for the customer. That username is only held as a one-way hash (along with the password) for authentication. It is not stored in its original form or in a form from which the original can be recovered. It's not feasible to ask the 100 million customers to provide their userids. This came from an actual customer on #seteam.
Update:
To give more detail on the problem, put_records are charged based on the number of records (partition keys) submitted and the size of the records. Any record that is smaller than 25KB is charged as one PU (Payload Unit). Our individual records average about 100 Bytes per second. If we put them individually we will spend a couple orders of magnitude more money on PUs than we need to.
Regardless of the solution we want a given UID to always end up in the same shard to simplify the work on the other end of Kinesis. This happens naturally if the UID is used as the partition key.
One way to deal with this would be to continue to do puts for each UID, but buffer them in time. But to efficiently use PUs we'd end up with a delay of 250 seconds introduced in the stream.
The combination of the answer given here and this question gives me a strategy for mapping multiple user IDs to static (predetermined) partition keys for each shard.
This would allow multiple UIDs to be batched into one Payload Unit (using the shared partition key for the target shard) so they can be written out as they come in each second while ensuring a given UID ends up in the correct shard.
Then I just need a buffer for each shard and as soon as enough records are present totaling just under 25KB or 500 records are reached (max per put_records call) the data can be pushed.
That just leaves figuring out ahead of time which shard a given UID would naturally map to if it was used as a partition key.
The AWS Kinesis documentation says this is the method:
Partition keys are Unicode strings with a maximum length limit of 256
bytes. An MD5 hash function is used to map partition keys to 128-bit
integer values and to map associated data records to shards.
Unless someone has done this before I'll try and see if the method in this question generates valid mappings. I'm wondering if I need to convert a regular Python string into a unicode string before doing the MD5.
There are probably other solutions, but this should work and I'll accept the existing answer here if no challenger appears.
Excerpt from a previous answer:
Try generating a few random partition_keys, and send distinct value with them to the stream.
Run a consumer application and see which shard delivered which value.
Then map the partition keys which you used to send each record with the corresponding shard.
So, now that you know which partition key to use while sending data to
a specific shard, you can use this map while sending those special "to
be multiplexed" records...
It's hacky and brute force, but it will work.
Also see previous answer regarding partition keys and shards:
https://stackoverflow.com/a/31377477/1622134
Hope this helps.
PS: If you use low level Kinesis APIs and create a custom PutRecord
request, in the response you can find which shard the data is placed
upon. PutRecordResponse contains shardId information;
http://docs.aws.amazon.com/kinesis/latest/APIReference/API_PutRecord.html
Source: https://stackoverflow.com/a/34901425/1622134
I'm storing tweets in DynamoDB. I'm using the tweet's id property for the hash key and the tweet's created_at property for the range.
I want to query on all the tweets in the table to find all tweets after a particular date. I gather that I need to make a GSI (Global Secondary Index) for the timestamp property of the tweet, so that I can query for all tweets after a particular date without needing the tweet's id property. Is this true? And if so, did I do this properly: (I'm confused as to why I need to specify a hash key and a range key for the GSI?)
So basically you want to create a range index on an attribute in DynamoDB. Tough luck as this is not what the author had in mind. I'll explain.
DynamoDB wants items to be distributed evenly across hashes and to have uniform load. Your twitter_id hash key is definitely helping but is failing you when you want to ask questions about your range keys.
You see, if you want speed - you want to query stuff as Query = index and Scan = no index. Query requires a hash key to query on - you can't query without one.
You are correct that you can't use your original primary key for this and you are correct thinking about GSI - You can by pass the hash key by creating a GSI that will have a constant hash* and timestamp as range.
BUT
If you do that you are breaking DynamoDB's performance by having an index with no distribution. This can cause you headaches in scale and generate bad throughput (you'll pay for more than you'll consume).
I put a star on constant hash* because you can do some manipulations to create several hashes and combine them in application level.
To conclude, it is possible to do what you want with Dynamo, but it is not a good fit for Dynamo.
Folks,
Retrieving all items from a DynamoDB table, I would like to replace the scan operation with a query.
Currently I am pulling in all the table's data via the following (python):
drivertable = Table(url['dbname'])
all_drivers = []
all_drivers_query = drivertable.scan()
for x in all_drivers_query:
all_drivers.append(x['number'])
How would i change this to use the query API?
Thanks!
There is no way to query and get the entire results of the table. As of right now, you have a few options if you want to get all of your data out of a DynamoDB, and all of them involve actually reading the data out of DynamoDB:
Scan the table. It can be done faster with the expense of using much more read capacity by using a parallel scan
Export your data using AWS Data Pipelines. You can configure the export job for where and how it should store your data.
Using one of the AWS event platforms for new data and denormalize it. For all new data you can get a time-ordered stream of all updates to the table from DynamoDB Update Streams or process events using AWS Lambda
You can't query an entire table. Query is used to retrieve a set of items by supplying a hash key (part of the complex primary key hash-range of the table).
One can not use query without knowing the hash keys.
EDIT as a bounty was added to this old question that asks:
How do I get a list of hashes from DynamoDB?
Well - In Dec 2014 you still can't ask via a single API for all hash keys of a table.
Even if you go and put a GSI you still can't get a DISTINCT hash count.
The way I would solve this is with de-normalization. Keep another table with no range key and put every hash there together with the main table. This adds house-keeping overhead to your application level (mainly when removing), but solves the problem you asked.