Storing tweets in DynamoDB

Storing tweets in DynamoDB - python

I'm storing tweets in DynamoDB. I'm using the tweet's id property for the hash key and the tweet's created_at property for the range.
I want to query on all the tweets in the table to find all tweets after a particular date. I gather that I need to make a GSI (Global Secondary Index) for the timestamp property of the tweet, so that I can query for all tweets after a particular date without needing the tweet's id property. Is this true? And if so, did I do this properly: (I'm confused as to why I need to specify a hash key and a range key for the GSI?)

So basically you want to create a range index on an attribute in DynamoDB. Tough luck as this is not what the author had in mind. I'll explain.
DynamoDB wants items to be distributed evenly across hashes and to have uniform load. Your twitter_id hash key is definitely helping but is failing you when you want to ask questions about your range keys.
You see, if you want speed - you want to query stuff as Query = index and Scan = no index. Query requires a hash key to query on - you can't query without one.
You are correct that you can't use your original primary key for this and you are correct thinking about GSI - You can by pass the hash key by creating a GSI that will have a constant hash* and timestamp as range.
BUT
If you do that you are breaking DynamoDB's performance by having an index with no distribution. This can cause you headaches in scale and generate bad throughput (you'll pay for more than you'll consume).
I put a star on constant hash* because you can do some manipulations to create several hashes and combine them in application level.
To conclude, it is possible to do what you want with Dynamo, but it is not a good fit for Dynamo.

Related

Retrieve records in DynamoDB by position

We have an application where the client needs to request some information stored on DynamoDB filtering by date. We have the service created with API Gateway and a Lambda retrieving the corresponding records on DynamoDB, so we have to retrieve all the necessary records in <30segs.
The volume of the records keeps increasing and we have thought in the following:
The client will ask for records in a concrete order (0-100, 100-200, 200-300, etc...) in order to display them on a concrete page on the frontend.
The backend will handle requests (and therefore search on DynamoDB) for that concrete order of records (0-100, 100-200, etc...)
Is there any way on DynamoDB to get the records from a concrete position to a concrete position? Or the only way is to retrieve all the records for that date range and then send the concrete positions to the client?
Thank you in advance,
Best regards.

You don’t specify a schema so I’m going to give you one. :)
Setup a sort key that’s the position number. Then you can efficiently retrieve by position number range.
Or instead of using ordinals if you want to use timestamps then just pass to the client the sort key starting point for their next request and use it as the lowrr value for the next query.
There’s no way to efficiently find the Nth many item in an item collection.

Is there a way to check if a key=value exists in a DynamoDB table?

I want to check if a specific key has a specific value in a dynamodb table with/without retrieving the entire item. Is there a way to do this in Python using boto3?
Note: I am looking to match a sort key with its value and check if that specific key value pair exists in the table,

It sounds like you want to fetch an item by it's sort key alone. While this is possible with the scan operation, it's not ideal.
DynamoDB gives us three ways to fetch data: getItem, query and scan.
The getItem operation allows you to fetch a single using it's primary key. The query operation can fetch multiple items within the same partition, but requires you to specify the partition key (and optionally the sort key). The scan operation lets you fetch items by specifying any attribute.
Therefore, if you want to fetch data form DynamoDB without using the full primary key or partition key, you can use the scan operation. However, be careful when using scan. From the docs:
The Scan operation returns one or more items and item attributes by accessing every item in a table or a secondary index.
The scan operation can be horribly inefficient if not used carefully. If you find yourself using scans frequently in your application or in a highly trafficked area of your app, you probably want to reorganize your data model.

What Seth said is 100% accurate, however, if you can add a GSI you can use the query option on the GSI. You could create a GSI that is just the value of the sort key, allowing you to query for records that match that sort key. You can even use the same field, and if you don't need any of the data you can just project the keys, keeping the cost relatively low.

DynamoDB best solution to making queries without using the primary key

I have a table in DyamoDB similar to this:
StaffID, Name, Email, Office
1514923 Winston Smith, SmithW#company.com, 101
It only has around 100 rows.
I'm experimenting with Amazon's Alexa and the possibility of using it for voice-based queries such as
'Where is Winston Smith?'
The problem is that when using an Alexa function to pull results from the table, it would never be through the primary key StaffID - because you wouldn't have users asking:
'Where is 1514923?'
From what I've read, querying the non-primary key values is extremely slow... Is there a suitable solution to this when using Python with DynamoDB?
I know that with only 100 rows it is negligible - but I'd like to do things in the correct, industry standard way. Or is the best solution with cases like this, to simply scan the tables - splitting them up for different user groups when they get too large?

There are two approaches here, depending on your application:
If you only ever want to query this table via the Name field, then change the table so that it has a partition key of Name instead of StaffID. DynamoDB isn't SQL - there's no need to have everything keyed on an ID field unless you actually use it. (Note you can't actually "change" the partition key on an existing DynamoDB table - you'll have to rebuild the table).
If you want to query efficiently via both StaffID and Name, create a global secondary index for the table using the Name field. Be aware that global secondary indexes need both their own provisioned throughput and storage, both of which of course equal money.
Minor aside: this is nothing to do with the fact you're using the Python interface, it applies to all DynamoDB access.

is it bad practise to store a table in a database with no primary key?

I am currently working on a list implementation in python that stores a persistent list as a database:
https://github.com/DarkShroom/sqlitelist
I am tackling a design consideration, it seems that SQLite allows me store the data without a primary key?
self.c.execute('SELECT * FROM unnamed LIMIT 1 OFFSET {}'.format(key))
this line of code can retrieve by absolute row reference
Is this bad practise? will I loose the data order at any point? Perhaps it's OKAY with sqlite, but my design will not translate to other database engines? Any thoughts from people more familiar with databases would be helpful. I am writing this so I don't have to deal with databases!

The documentation says:
If a SELECT statement that returns more than one row does not have an ORDER BY clause, the order in which the rows are returned is undefined.
So you cannot simply use OFFSET to identify rows.
A PRIMARY KEY constraint just tells the database that is must enforce UNIQUE and NOT NULL constraints on the PK columns. If you do not declare a PRIMARY KEY, these constraints are not automatically enforced, but this does not change the fact that you have to identify your rows somehow when you want to access them.
The easiest way to store list entries is to have the position in the list as a separate column. (If your program takes up most of its time inserting or deleting list entries, it might be a better idea to store the list not as an array but as a linked list, i.e., the database does not store the position but a pointer to the next entry.)

Counting the number of distinct strings given by a GQL Query in Python

Suppose I have the following GQL database,
class Signatories(db.Model):
name = db.StringProperty()
event = db.StringProperty()
This database holds information regarding events that people have signed up for. Say I have the following entries in the database in the format (event_name, event_desc): (Bob, TestEvent), (Bob, TestEvent2), (Fred, TestEvent), (John, TestEvent).
But the dilemma here is that I cannot just aggregate all of Bob's events into one entity because I'd like to Query for all the people signed up for a specific event and also I'd like to add such entries without having to manually update the entry every single time.
How could I count the number of distinct strings given by a GQL Query in Python (in my example, I am specifically trying to see how many people are currently signed up for events)?
I have tried using the old mcount = db.GqlQuery("SELECT name FROM Signatories").count(), however this of course returns the total number of strings in the list, regardless of the uniqueness of each string.
I have also tried using count = len(member), where member = db.GqlQuery("SELECT name FROM Signatories"), but unfortunately, this only returns an error.

You can't - at least not directly. (By the way you don't have a GQL database).
If you have a small number of items, then fetch them into memory, and use a set operation to produce the unique set and then count
If you have larger numbers of entities that make in memory filtering and counting problematic then your strategy will be to aggregate the count as you create them,
e.g.
create a separate entity each time you create an event that has the pair of strings as the key. This way you will only have one entity the data store representing the specific pair. Then you can do a straight count.
However as you get large numbers of these entities you will need to start performing some additional work to count them as the single query.count() will become too expensive. You then need to start looking at counting strategies using the datastore.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Storing tweets in DynamoDB - python

Related

Retrieve records in DynamoDB by position

Is there a way to check if a key=value exists in a DynamoDB table?

DynamoDB best solution to making queries without using the primary key

is it bad practise to store a table in a database with no primary key?

Counting the number of distinct strings given by a GQL Query in Python

Categories

Resources