DynamoDB best solution to making queries without using the primary key

DynamoDB best solution to making queries without using the primary key - python

I have a table in DyamoDB similar to this:
StaffID, Name, Email, Office
1514923 Winston Smith, SmithW#company.com, 101
It only has around 100 rows.
I'm experimenting with Amazon's Alexa and the possibility of using it for voice-based queries such as
'Where is Winston Smith?'
The problem is that when using an Alexa function to pull results from the table, it would never be through the primary key StaffID - because you wouldn't have users asking:
'Where is 1514923?'
From what I've read, querying the non-primary key values is extremely slow... Is there a suitable solution to this when using Python with DynamoDB?
I know that with only 100 rows it is negligible - but I'd like to do things in the correct, industry standard way. Or is the best solution with cases like this, to simply scan the tables - splitting them up for different user groups when they get too large?

There are two approaches here, depending on your application:
If you only ever want to query this table via the Name field, then change the table so that it has a partition key of Name instead of StaffID. DynamoDB isn't SQL - there's no need to have everything keyed on an ID field unless you actually use it. (Note you can't actually "change" the partition key on an existing DynamoDB table - you'll have to rebuild the table).
If you want to query efficiently via both StaffID and Name, create a global secondary index for the table using the Name field. Be aware that global secondary indexes need both their own provisioned throughput and storage, both of which of course equal money.
Minor aside: this is nothing to do with the fact you're using the Python interface, it applies to all DynamoDB access.

Related

How can I have a database with thousands of tables with varying number of columns that are all of the same class in Django / SQLAlchemy ORM?

I have financial statement data on thousands of different companies. Some of the companies have data only for 2019, but for some I have decade long data. Each company financial statement have its own table structured as follows with columns in bold:
lineitem---2019---2018---2017
2...............1000....800.....600
3206...........700....300....-200
56.................50....100.....100
200...........1200......90.....700
This structure is preferred over more of a flat file structure like lineitem-year-amount since one query gives me the correct structure of the output for a financial statement table. lineitem is a foreignkey linking to the primary key of a mapping table with over 10,000 records. 3206 can for example mean "Debt to credit instituions". I also have a companyIndex table which has the company ID, company name, and table name. I am able to get the data into the database and make queries using sqlite3 in python, but advanced queries is somewhat of a challenge at times, not to mention that it can take a lot of time and not be very readable. I like the potential of using ORM in Django or SQLAlchemy. The ORM in SQLAlchemy seems to want me to know the name of the table I am about to create and want me to know how many columns to create, but I don't know that since I have a script that parses a datadump in csv which includes the company ID and financial statement data for the number of years it has operated. Also, one year later I will have to update the table with one additional year of data.
I have been watching and reading tutorials Django and SQLAlchemy, but have not been able to try it out too much in practise due to this initial problem which is a prerequisite for succeding in my project. I have googled and googled, and checked stackoverflow for a solution, but not found any solved questions (which is really surprising since I always find the solution on here).
So how can I insert the data using Django/SQLAlchemy given the structure I plan to have it fit into? How can I have the selected table(s) (based on company ID or company name) be an object(s) in ORM just like any other object allowing me the select the data I want at the granularity level I want?
Ideally there is a solution to this in Django, but since I haven't found anything I suspect there is not or that how I have structured the database is insanity.

You cannot find a solution because there is none.
You are mixing the input data format with the table schema.
You establish an initial database table schema and then add data as rows to the tables.
You never touch the database table columns again, unless you decide that the schema has to be altered to support different, usually additional functionality in the application, because for example, at a certain point in the application lifetime, new attributes become required for data. Not because there is more data, wich simply translates to new data rows in one or more tables.
So first you decide about a proper schema for database tables, based on the data records you will be reading or importing from somewhere.
Then you make sure the database is normalized until 3rd normal form.
You really have to understand this. Haven't read it, just skimmed over but I assume it is correct. This is fundamental database knowledge you cannot escape. After learning it right and with practice it becomes second nature and you will apply the rules without even noticing.
Then your problems will vanish, and you can do what you want with whatever relational database or ORM you want to use.
The only remaining problem is that input data needs validation, and sometimes it is not given to us in the proper form. So the program, or an initial import procedure, or further data import operations, may need to give data some massaging before writing the proper data rows into the existing tables.

Storing tweets in DynamoDB

I'm storing tweets in DynamoDB. I'm using the tweet's id property for the hash key and the tweet's created_at property for the range.
I want to query on all the tweets in the table to find all tweets after a particular date. I gather that I need to make a GSI (Global Secondary Index) for the timestamp property of the tweet, so that I can query for all tweets after a particular date without needing the tweet's id property. Is this true? And if so, did I do this properly: (I'm confused as to why I need to specify a hash key and a range key for the GSI?)

So basically you want to create a range index on an attribute in DynamoDB. Tough luck as this is not what the author had in mind. I'll explain.
DynamoDB wants items to be distributed evenly across hashes and to have uniform load. Your twitter_id hash key is definitely helping but is failing you when you want to ask questions about your range keys.
You see, if you want speed - you want to query stuff as Query = index and Scan = no index. Query requires a hash key to query on - you can't query without one.
You are correct that you can't use your original primary key for this and you are correct thinking about GSI - You can by pass the hash key by creating a GSI that will have a constant hash* and timestamp as range.
BUT
If you do that you are breaking DynamoDB's performance by having an index with no distribution. This can cause you headaches in scale and generate bad throughput (you'll pay for more than you'll consume).
I put a star on constant hash* because you can do some manipulations to create several hashes and combine them in application level.
To conclude, it is possible to do what you want with Dynamo, but it is not a good fit for Dynamo.

Which way of fetching is more Efficient in Redis?

Hi I'm fairly new to Redis and Currently face a problem. My problem is "I don't know which way is better performance"
Way#1 : Cache All data to Redis and then Query to it ( I don't know is it possible to Query to Redis ? if possible How ? )
for example in following table cache all data to single Key ( By this way in my table we have 1 key ) and then Query for users with same City.
Way#2 : Cache all users with same City in separate Key ( By this way in my table we have 4 key ) and then Fetch each Key separately.

Cache all users with same City in separate Key - the Redis way. Fast insert, fast get in cost of much memory consumption or some data redundancy.
In general you can't follow your way#1 example. Why not? Redis do not have any in box solutions for query data in sql terms. You can't do something like select something from somethere where criteria in most of Redis data structures. You can write LUA script for complex map/redus solutioin on your data - but not in out of box.
You should remeber, each time you want to say Join this and this data you should understand - you can do this only in client application space or in redis LUA script. Yes, you have some types in join capacity with ZSET and SET's but it not that you require.

AWS DynamoDB retrieve entire table

Folks,
Retrieving all items from a DynamoDB table, I would like to replace the scan operation with a query.
Currently I am pulling in all the table's data via the following (python):
drivertable = Table(url['dbname'])
all_drivers = []
all_drivers_query = drivertable.scan()
for x in all_drivers_query:
all_drivers.append(x['number'])
How would i change this to use the query API?
Thanks!

There is no way to query and get the entire results of the table. As of right now, you have a few options if you want to get all of your data out of a DynamoDB, and all of them involve actually reading the data out of DynamoDB:
Scan the table. It can be done faster with the expense of using much more read capacity by using a parallel scan
Export your data using AWS Data Pipelines. You can configure the export job for where and how it should store your data.
Using one of the AWS event platforms for new data and denormalize it. For all new data you can get a time-ordered stream of all updates to the table from DynamoDB Update Streams or process events using AWS Lambda

You can't query an entire table. Query is used to retrieve a set of items by supplying a hash key (part of the complex primary key hash-range of the table).
One can not use query without knowing the hash keys.
EDIT as a bounty was added to this old question that asks:
How do I get a list of hashes from DynamoDB?
Well - In Dec 2014 you still can't ask via a single API for all hash keys of a table.
Even if you go and put a GSI you still can't get a DISTINCT hash count.
The way I would solve this is with de-normalization. Keep another table with no range key and put every hash there together with the main table. This adds house-keeping overhead to your application level (mainly when removing), but solves the problem you asked.

What are some ways to maintain data consistency at the application layer of NoSQL?

My python web application uses DynamoDB as its datastore, but this is probably applicable to other NoSQL tables where index consistency is done at the application layer. I'm de-normalizing data and creating indicies in several tables to facilitate lookups.
For example, for my users table:
* Table 1: (user_id) email, employee_id, first name, last name, etc ...
Table 2: (email) user_id
Table 3: (employee_id) user_id
Table 1 is my "primary table" where user info is stored. If the user_id is known, all info about a user can be retrieved in a single GET query.
Table 2 and 3 enable lookups by email or employee_id, requiring a query to those tables first to get the user_id, then a second query to Table 1 to retrieve the rest of the information.
My concern is with the de-normalized data -- what is the best way to handle deletions from Table 1 to ensure the matching data gets deleted from Tables 2 + 3? Also ensuring inserts?
Right now my chain of events is something like:
1. Insert row in table 1
2. Insert row in table 2
3. Insert row in table 3
Does it make sense to add "checks" at the end? Some thing like:
4. Check that all 3 rows have been inserted.
5. If a row is missing, remove rows from all tables and raise an error.
Any other techniques?

Short answer is: There is no way to ensure consistency. This is the price you agreed to pay when moving to NoSQL in trade of performances and scalability.
DynamoDB-mapper has a "transaction engine". Transaction objects are plain DynamoDB Items and may be persisted. This way, If a logical group of actions aka transaction has succeeded, we can be sure of it by looking at the persisted status. But we have no mean to be sure it has not...
To do a bit of advertisment :) , dynamodb-mapper transaction engine supports
single/multiple targets
sub transactions
transaction creating objects (not released yet)
If you are rolling your own mapper (which is an enjoyable task), feel free to have a look at our source code: https://bitbucket.org/Ludia/dynamodb-mapper/src/52c75c5df921/dynamodb_mapper/transactions.py
Disclaimer: I am one of the main dynamodb-mapper project. Feel free to contribute :)

Disclaimer: I haven't actually used DynamoDB, just looked through the data model and API, so take this for what it's worth.
The use case you're giving is one primary table for the data, with other tables for hand-rolled indices. This really sounds like work for an RDBMS (maybe with some sharding for growth). But, if that won't cut it, here a couple of ideas which may or may not work for you.
A. Leave it as it is. If you'll never serve data from your index tables, then maybe you can afford to have lazy deletion and insertion as long as you handle the primary table first. Say this happens:
1) Delete JDoe from Main table
xxxxxxxxxx Process running code crashes xxxxxxx
2) Delete from email index // Never gets here
3) Delete from employee_id index // Never gets here
Well, if an "email" query comes in, you'll resolve the corresponding user_id from the index (now stale), but it won't show up on the main table. You know that something is wrong, so you can return a failure/error and clean up the indexes. In other words, you just live with some stale data and save yourself the trouble, cleaning it up as necessary. You'll have to figure out how much stale data to expect, and maybe write a script that does some housekeeping daily.
B. If you really want to simulate locks and transactions, you could consider using something like Apache Zookeeper, which is a distributed system for managing shared resources like locks. It'd be more work and overhead, but you could probably set it up to do what you want.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.