Storing chat logs in relational database - python

I am writing a chat bot that uses past conversations to generate its responses. Currently I use text files to store all the data but I want to use a database instead so that multiple instances of the bot can use it at the same time.
How should I structure this database?
My first idea was to keep a main table like create table Sessions (startTime INT,ip INT, botVersion REAL, length INT, tableName TEXT). Then for each conversation I create table <generated name>(timestamp INT, message TEXT) with all the messages that were sent or received during that conversation. When the conversation is over, I insert the name of the new table into Sessions(tableName). Is it ok to programmatically create tables in this manner? I am asking because most SQL tutorials seem to suggest that tables are created when the program is initialized.
Another way to do this is to have a huge create table Messages(id INT, message TEXT) table that stores every message that was sent or received. When a conversation is over, I can add a new entry to Sessions that includes the id used during that conversation so that I can look up all the messages sent during a certain conversation. I guess one advantage of this is that I don't need to have hundreds or thousands of tables.
I am planning on using SQLite despite its low concurrency since each instance of the bot may make thousands of reads before generating a response (which will result in one write). Still, if another relational database is better suited for this task, please comment.
Note: There are other questions on SO about storing chat logs in databases but I am specifically looking for how it should be structured and feedback on the above ideas.

Don't use a different table for each conversation. Instead add a "conversation" column to your single table.

Related

Detect New Entries in Database Using Python

I'm currently trying to make a twitter bot that will print out new entries from my database. I'm trying to use python to do this, and can successfully post messages to Twitter. However, whenever a new entry comes in it doesn't update.
How would I go about implementing something like this, and what would I use? I'm not too experienced with this topic area. Any help or guidance would be appreciated.
Use a trigger to propogate the newly inserted rows from the original table to a record table which is under the surveillance of python, and have python post the new record (possibly remove the already posted ones from the record table)
DELIMITER //
drop trigger if exists record_after_insert //
create trigger record_after_insert after insert on original_table for each row begin
insert record_table (new_record) values (new.new_message);
end //

Update SQL database registers based on JSON

I have a table with 30k clients, with the ClientID as primary key.
I'm getting data from API calls and inserting them into the table using python.
I'd like to find a way to insert rows with new clients and, if the ClientID that comes with the API call already exists in the table, update the existing register with the updated information of this client.
Thanks!!
A snippet of code would be nice to show us what exactly you are doing right now. I presume you are using an ORM like SqlAlchemy? If so, then you are looking at doing an UPSERT type of an operation.
That is already answered HERE
Alternatively, if you are executing raw queries without an ORM then you could write a custom procedure and pass required parameters. HERE is a good write up on how that is done in MSSQL under high concurrency. You could use this as a starting point for understanding and then re-write it for PostgreSQL.

dynamodb update denormalized data and keep consistency

I am using dynamodb with python API and denormalize my data in order to keep the reads fast. The think is that I am worried about keeping the consistency when updating my data say i have a table of users, each has a key and a name, and a table of purchases each has a key and a data containing buyer key (user) and the buyer's name.
I would like to update the user's name and update all his purchases using an atomic operation, like available in firebase (multi path update) explained here
How can I do that?
Thanks
Here is a nice documentation of dynamodb transaction.
Here are few highlights of the blog post.
Dynamodb supports transaction capability across multiple table where you can also have pre-condition on every insert (i.e. insert into order table only if prev_snapshot=1223232, this will make sure you are modifying the last read data only.)
There are 2 types of gets supported TransactGetItems and Eventual/Strongly consistent GetItem. In TransactGetItems, if a transaction is in progress the request is rejected. while in the other 2 cases last committed data is returned based on your consistency requirements.
Transactions are not locks if some other thread is writing to a table without transaction, and if write succeeds before transaction is completed, and exception will be thrown on transaction.
No extra steps/permissions are required to enable transaction on a single region table.
Cost will double for every read and write whiles using transactional capabilities.
Here are the features which are not supported
Transactional capabilities in global table. but this can be avoided by request stickiness and should not be a big issue IMO.

AWS DynamoDB retrieve entire table

Folks,
Retrieving all items from a DynamoDB table, I would like to replace the scan operation with a query.
Currently I am pulling in all the table's data via the following (python):
drivertable = Table(url['dbname'])
all_drivers = []
all_drivers_query = drivertable.scan()
for x in all_drivers_query:
all_drivers.append(x['number'])
How would i change this to use the query API?
Thanks!
There is no way to query and get the entire results of the table. As of right now, you have a few options if you want to get all of your data out of a DynamoDB, and all of them involve actually reading the data out of DynamoDB:
Scan the table. It can be done faster with the expense of using much more read capacity by using a parallel scan
Export your data using AWS Data Pipelines. You can configure the export job for where and how it should store your data.
Using one of the AWS event platforms for new data and denormalize it. For all new data you can get a time-ordered stream of all updates to the table from DynamoDB Update Streams or process events using AWS Lambda
You can't query an entire table. Query is used to retrieve a set of items by supplying a hash key (part of the complex primary key hash-range of the table).
One can not use query without knowing the hash keys.
EDIT as a bounty was added to this old question that asks:
How do I get a list of hashes from DynamoDB?
Well - In Dec 2014 you still can't ask via a single API for all hash keys of a table.
Even if you go and put a GSI you still can't get a DISTINCT hash count.
The way I would solve this is with de-normalization. Keep another table with no range key and put every hash there together with the main table. This adds house-keeping overhead to your application level (mainly when removing), but solves the problem you asked.

What are some ways to maintain data consistency at the application layer of NoSQL?

My python web application uses DynamoDB as its datastore, but this is probably applicable to other NoSQL tables where index consistency is done at the application layer. I'm de-normalizing data and creating indicies in several tables to facilitate lookups.
For example, for my users table:
* Table 1: (user_id) email, employee_id, first name, last name, etc ...
Table 2: (email) user_id
Table 3: (employee_id) user_id
Table 1 is my "primary table" where user info is stored. If the user_id is known, all info about a user can be retrieved in a single GET query.
Table 2 and 3 enable lookups by email or employee_id, requiring a query to those tables first to get the user_id, then a second query to Table 1 to retrieve the rest of the information.
My concern is with the de-normalized data -- what is the best way to handle deletions from Table 1 to ensure the matching data gets deleted from Tables 2 + 3? Also ensuring inserts?
Right now my chain of events is something like:
1. Insert row in table 1
2. Insert row in table 2
3. Insert row in table 3
Does it make sense to add "checks" at the end? Some thing like:
4. Check that all 3 rows have been inserted.
5. If a row is missing, remove rows from all tables and raise an error.
Any other techniques?
Short answer is: There is no way to ensure consistency. This is the price you agreed to pay when moving to NoSQL in trade of performances and scalability.
DynamoDB-mapper has a "transaction engine". Transaction objects are plain DynamoDB Items and may be persisted. This way, If a logical group of actions aka transaction has succeeded, we can be sure of it by looking at the persisted status. But we have no mean to be sure it has not...
To do a bit of advertisment :) , dynamodb-mapper transaction engine supports
single/multiple targets
sub transactions
transaction creating objects (not released yet)
If you are rolling your own mapper (which is an enjoyable task), feel free to have a look at our source code: https://bitbucket.org/Ludia/dynamodb-mapper/src/52c75c5df921/dynamodb_mapper/transactions.py
Disclaimer: I am one of the main dynamodb-mapper project. Feel free to contribute :)
Disclaimer: I haven't actually used DynamoDB, just looked through the data model and API, so take this for what it's worth.
The use case you're giving is one primary table for the data, with other tables for hand-rolled indices. This really sounds like work for an RDBMS (maybe with some sharding for growth). But, if that won't cut it, here a couple of ideas which may or may not work for you.
A. Leave it as it is. If you'll never serve data from your index tables, then maybe you can afford to have lazy deletion and insertion as long as you handle the primary table first. Say this happens:
1) Delete JDoe from Main table
xxxxxxxxxx Process running code crashes xxxxxxx
2) Delete from email index // Never gets here
3) Delete from employee_id index // Never gets here
Well, if an "email" query comes in, you'll resolve the corresponding user_id from the index (now stale), but it won't show up on the main table. You know that something is wrong, so you can return a failure/error and clean up the indexes. In other words, you just live with some stale data and save yourself the trouble, cleaning it up as necessary. You'll have to figure out how much stale data to expect, and maybe write a script that does some housekeeping daily.
B. If you really want to simulate locks and transactions, you could consider using something like Apache Zookeeper, which is a distributed system for managing shared resources like locks. It'd be more work and overhead, but you could probably set it up to do what you want.

Categories