AWS DynamoDB retrieve entire table - python

Folks,
Retrieving all items from a DynamoDB table, I would like to replace the scan operation with a query.
Currently I am pulling in all the table's data via the following (python):
drivertable = Table(url['dbname'])
all_drivers = []
all_drivers_query = drivertable.scan()
for x in all_drivers_query:
all_drivers.append(x['number'])
How would i change this to use the query API?
Thanks!

There is no way to query and get the entire results of the table. As of right now, you have a few options if you want to get all of your data out of a DynamoDB, and all of them involve actually reading the data out of DynamoDB:
Scan the table. It can be done faster with the expense of using much more read capacity by using a parallel scan
Export your data using AWS Data Pipelines. You can configure the export job for where and how it should store your data.
Using one of the AWS event platforms for new data and denormalize it. For all new data you can get a time-ordered stream of all updates to the table from DynamoDB Update Streams or process events using AWS Lambda

You can't query an entire table. Query is used to retrieve a set of items by supplying a hash key (part of the complex primary key hash-range of the table).
One can not use query without knowing the hash keys.
EDIT as a bounty was added to this old question that asks:
How do I get a list of hashes from DynamoDB?
Well - In Dec 2014 you still can't ask via a single API for all hash keys of a table.
Even if you go and put a GSI you still can't get a DISTINCT hash count.
The way I would solve this is with de-normalization. Keep another table with no range key and put every hash there together with the main table. This adds house-keeping overhead to your application level (mainly when removing), but solves the problem you asked.

Related

Update SQL database registers based on JSON

I have a table with 30k clients, with the ClientID as primary key.
I'm getting data from API calls and inserting them into the table using python.
I'd like to find a way to insert rows with new clients and, if the ClientID that comes with the API call already exists in the table, update the existing register with the updated information of this client.
Thanks!!
A snippet of code would be nice to show us what exactly you are doing right now. I presume you are using an ORM like SqlAlchemy? If so, then you are looking at doing an UPSERT type of an operation.
That is already answered HERE
Alternatively, if you are executing raw queries without an ORM then you could write a custom procedure and pass required parameters. HERE is a good write up on how that is done in MSSQL under high concurrency. You could use this as a starting point for understanding and then re-write it for PostgreSQL.

Is there a way to check if a key=value exists in a DynamoDB table?

I want to check if a specific key has a specific value in a dynamodb table with/without retrieving the entire item. Is there a way to do this in Python using boto3?
Note: I am looking to match a sort key with its value and check if that specific key value pair exists in the table,
It sounds like you want to fetch an item by it's sort key alone. While this is possible with the scan operation, it's not ideal.
DynamoDB gives us three ways to fetch data: getItem, query and scan.
The getItem operation allows you to fetch a single using it's primary key. The query operation can fetch multiple items within the same partition, but requires you to specify the partition key (and optionally the sort key). The scan operation lets you fetch items by specifying any attribute.
Therefore, if you want to fetch data form DynamoDB without using the full primary key or partition key, you can use the scan operation. However, be careful when using scan. From the docs:
The Scan operation returns one or more items and item attributes by accessing every item in a table or a secondary index.
The scan operation can be horribly inefficient if not used carefully. If you find yourself using scans frequently in your application or in a highly trafficked area of your app, you probably want to reorganize your data model.
What Seth said is 100% accurate, however, if you can add a GSI you can use the query option on the GSI. You could create a GSI that is just the value of the sort key, allowing you to query for records that match that sort key. You can even use the same field, and if you don't need any of the data you can just project the keys, keeping the cost relatively low.

dynamodb update denormalized data and keep consistency

I am using dynamodb with python API and denormalize my data in order to keep the reads fast. The think is that I am worried about keeping the consistency when updating my data say i have a table of users, each has a key and a name, and a table of purchases each has a key and a data containing buyer key (user) and the buyer's name.
I would like to update the user's name and update all his purchases using an atomic operation, like available in firebase (multi path update) explained here
How can I do that?
Thanks
Here is a nice documentation of dynamodb transaction.
Here are few highlights of the blog post.
Dynamodb supports transaction capability across multiple table where you can also have pre-condition on every insert (i.e. insert into order table only if prev_snapshot=1223232, this will make sure you are modifying the last read data only.)
There are 2 types of gets supported TransactGetItems and Eventual/Strongly consistent GetItem. In TransactGetItems, if a transaction is in progress the request is rejected. while in the other 2 cases last committed data is returned based on your consistency requirements.
Transactions are not locks if some other thread is writing to a table without transaction, and if write succeeds before transaction is completed, and exception will be thrown on transaction.
No extra steps/permissions are required to enable transaction on a single region table.
Cost will double for every read and write whiles using transactional capabilities.
Here are the features which are not supported
Transactional capabilities in global table. but this can be avoided by request stickiness and should not be a big issue IMO.

DynamoDB best solution to making queries without using the primary key

I have a table in DyamoDB similar to this:
StaffID, Name, Email, Office
1514923 Winston Smith, SmithW#company.com, 101
It only has around 100 rows.
I'm experimenting with Amazon's Alexa and the possibility of using it for voice-based queries such as
'Where is Winston Smith?'
The problem is that when using an Alexa function to pull results from the table, it would never be through the primary key StaffID - because you wouldn't have users asking:
'Where is 1514923?'
From what I've read, querying the non-primary key values is extremely slow... Is there a suitable solution to this when using Python with DynamoDB?
I know that with only 100 rows it is negligible - but I'd like to do things in the correct, industry standard way. Or is the best solution with cases like this, to simply scan the tables - splitting them up for different user groups when they get too large?
There are two approaches here, depending on your application:
If you only ever want to query this table via the Name field, then change the table so that it has a partition key of Name instead of StaffID. DynamoDB isn't SQL - there's no need to have everything keyed on an ID field unless you actually use it. (Note you can't actually "change" the partition key on an existing DynamoDB table - you'll have to rebuild the table).
If you want to query efficiently via both StaffID and Name, create a global secondary index for the table using the Name field. Be aware that global secondary indexes need both their own provisioned throughput and storage, both of which of course equal money.
Minor aside: this is nothing to do with the fact you're using the Python interface, it applies to all DynamoDB access.

How do I delete items from a DynamoDB table wherever an attribute is missing, regardless of key?

Is it possible to delete items from a DynamoDB table without specifying partition or sort keys? I have numerous entries in a table with different partition and sort keys and I want to delete all the items where a certain attribute does not exist.
AWS CLI or boto3/python solutions are welcome.
To delete large number of items from the table you need to query or scan first and then delete the items using BatchWriteItem or DeleteItem operation.
Query and BatchWriteItem is better interms of performance and cost, so if this is a job that happens frequently, its better to add a global secondary index on the attribute you need to check for deletion. However you need to manage BatchWriteItem iteratively for large number of items since query will return paginated values.
Else you can do a scan and
DeleteItem iteratively.
Check this Stackoverflow question for more insight.
It worth to try to use EMR Hive integration with DynamoDB. It allows you to write SQL queries against a DynamoDB. Hive supports DELETE statement and Amazon have implemented a DynamoDB connector. I am not sure if this would integrate perfectly, but this worth a try. Here is how to work with DynamoDB using EMR Hive.
Another option is to use parallel scan. Just get all items from DynamoDB that match a filter expression, and delete each one of them. Here is how to do scans using boto client.
To speed up the process you can batch delete items using the BatchWriteItem method. Here is how to do this in boto.
Notice that BatchWriteItem has following limitations:
BatchWriteItem can write up to 16 MB of data, which can comprise as
many as 25 put or delete requests.
Keep in mind that scans are expensive when you are doing scans you consume RCU for all items DynamoDB reads in your table and not for items it returns. So you either need to read data slowly or provision very high RCU for a table.
It's ok to do this operation infrequently, but you can't do it as a part of a web-server request if you have a table of a decent size.

Categories