How to implement sql databases in discord.py bot - python

I'm new to python and was trying to create a python bot, I wanted a optimized way to modify and access my bot configs per server. I had 2 ideas on how/when to fetch configs from the database for optimization.
this is what you would normally do - just fetch data variables(fetch a variable at a time) for each command, this would keep the bot simple and minimize unused recources.
In this one, whenever the user uses a command for the first time, it fetches the entire config table and stores it in a loaded dict from which you can access the config from. you can also update the config in the dict and every 30m-1hr it will log the values in the table and empty the dict. The benefit of this one is less sql calls but potentially less scalability because of unused objects in the dict.
Can someone help me decide which one is better, i dont know normally how you would make discord bots or the convention.

Your second approach is called caching the data. You're basically creating a cached database in your application (the dictionary) and save a bunch of usually necessary data to access them quickly. It is what every (almost every) major service (like Steam) does in order to minimize the main database calls.
I think this is the better practice however it has its drawbacks.
First, from time to time, you have to compare the cached data with what you have in the original database because your bot will not have a single user and while the cached data is available to one user, another user might alter the data in the original database.
Second, it is harder to implement than the first approach. You need to determine which data to store, which data to update rapidly and also you need to implement an alarm system for the users to update their cache whenever the main data is altered in the database.
If I were you and I just wanted to mess around with bots, I would go with just fetching the data each time from the database. It's easier and it is good enough for most applications.

Related

How to scale Python Heroku app that needs to perform an action for every entry in a database

Not able to articulate this well but I will try. I have a Python script that is acting as a middle man between two APIs. It pulls info from API 1, uses that that info to get more information from API 2, then sends a request back to API 1 with the updated information. It only performs this update routine when a specific field is updated by the user in the integration. An ID is stored for the integration
Currently, I am testing this with one ID (my test integration). I have it running in an infinite loop so that it is constantly checking to update.
I plan on making this public so that anyone can integrate with this. The IDs will be stored in a PostGres database with Heroku. The problem is that architecture of my program. The more people that integrate with this, the more IDs the program needs to cycle through and check in a linear fashion.
What would be the ideal way to parse every ID and check for updates every few seconds without losing speed when the number of integrations and IDs increase?

Saving frequently updated arrays to database

I am creating a machine learning app which should save a number to a database locally and frequently.
These number values are connected which basically means that I want to frequently update a time series of values by appending a number to the list.
An ideal case would be to be able to save key-value pairs where key would represent the name of the array (example train_loss) and value would be according time series.
My first idea was leveraging redis but as far as I know redis data is only saved in RAM? What I want to achieve is saving to a disk after every log or perhaps after every couple of logs.
I need the local save of data since this data will be consumed by other app (in javascript). Therefore some JSON-like format would be nice.
Using JSON files (and Python json package) is an option, but I believe it would result in an I/O bottleneck because of frequent updates.
I am basically trying to create a clone of a web app like Tensorboard.
A technique we use in the backend of hosted application for frequently used read/post api is to write to the Redis and DB at the same time and during the read operation we check if the key is available in the Redis and if it's not we read update it to the Redis and then serve it

Etags used in RESTful APIs are still susceptible to race conditions

Maybe I'm overlooking something simple and obvious here, but here goes:
So one of the features of the Etag header in a HTTP request/response it to enforce concurrency, namely so that multiple clients cannot override each other's edits of a resource (normally when doing a PUT request). I think that part is fairly well known.
The bit I'm not so sure about is how the backend/API implementation can actually implement this without having a race condition; for example:
Setup:
RESTful API sits on top of a standard relational database, using an ORM for all interactions (SQL Alchemy or Postgres for example).
Etag is based on 'last updated time' of the resource
Web framework (Flask) sits behind a multi threaded/process webserver (nginx + gunicorn) so can process multiple requests concurrently.
The problem:
Client 1 and 2 both request a resource (get request), both now have the same Etag.
Both Client 1 and 2 sends a PUT request to update the resource at the same time. The API receives the requests, proceeds to uses the ORM to fetch the required information from the database then compares the request Etag with the 'last updated time' from the database... they match so each is a valid request. Each request continues on and commits the update to the database.
Each commit is a synchronous/blocking transaction so one request will get in before the other and thus one will override the others changes.
Doesn't this break the purpose of the Etag?
The only fool-proof solution I can think of is to also make the database perform the check, in the update query for example. Am I missing something?
P.S Tagged as Python due to the frameworks used but this should be a language/framework agnostic problem.
This is really a question about how to use ORMs to do updates, not about ETags.
Imagine 2 processes transferring money into a bank account at the same time -- they both read the old balance, add some, then write the new balance. One of the transfers is lost.
When you're writing with a relational DB, the solution to these problems is to put the read + write in the same transaction, and then use SELECT FOR UPDATE to read the data and/or ensure you have an appropriate isolation level set.
The various ORM implementations all support transactions, so getting the read, check and write into the same transaction will be easy. If you set the SERIALIZABLE isolation level, then that will be enough to fix race conditions, but you may have to deal with deadlocks.
ORMs also generally support SELECT FOR UPDATE in some way. This will let you write safe code with the default READ COMMITTED isolation level. If you google SELECT FOR UPDATE and your ORM, it will probably tell you how to do it.
In both cases (serializable isolation level or select for update), the database will fix the problem by getting a lock on the row for the entity when you read it. If another request comes in and tries to read the entity before your transaction commits, it will be forced to wait.
Etag can be implemented in many ways, not just last updated time. If you choose to implement the Etag purely based on last updated time, then why not just use the Last-Modified header?
If you were to encode more information into the Etag about the underlying resource, you wouldn't be susceptible to the race condition that you've outlined above.
The only fool proof solution I can think of is to also make the database perform the check, in the update query for example. Am I missing something?
That's your answer.
Another option would be to add a version to each of your resources which is incremented on each successful update. When updating a resource, specify both the ID and the version in the WHERE. Additionally, set version = version + 1. If the resource had been updated since the last request then the update would fail as no record would be found. This eliminates the need for locking.
You are right that you can still get race conditions if the 'check last etag' and 'make the change' aren't in one atomic operation.
In essence, if your server itself has a race condition, sending etags to the client won't help with that.
You already mentioned a good way to achieve this atomicity:
The only fool-proof solution I can think of is to also make the database perform the check, in the update query for example.
You could do something else, like using a mutex lock. Or using an architecture where two threads cannot deal with the same data.
But the database check seems good to me. What you describe about ORM checks might be an addition for better error messages, but is not by itself sufficient as you found.

Writing a Django backend program that runs indefinitely -- what to keep in mind?

I am trying to write a Django app that queries a remote database for some data, performs some calculations on a portion of this data and stores the results (in the local database using Django models). It also filters another portion and stores the result separately. My front end then queries my Django database for these processed data and displays them to the user.
My questions are:
How do I write an agent program that continuously runs in the backend, downloads data from the remote database, does calculations/ filtering and stores the result in the local Django database ? Particularly, what are the most important things to keep in mind when writing a program that runs indefinitely?
Is using cron for this purpose a good idea ?
The data retrieved from the remote database belong to multiple users and each user's data must be kept/ stored separately in my local database as well. How do I achieve that? using row-level/ class-instance level permissions maybe? Remember that the backend agent does the storage, update and delete. Front end only reads data (through http requests).
And finally, I allow creation of new users. If a new user has valid credentials for the remote database the user should be allowed to use my app. In which case, my backend will download this particular user's data from the remote database, performs calculations/ filtering and presents the results to the user. How can I handle the dynamic creation of objects/ database tables for the new users? and how can I differentiate between users' data when retrieving them ?
Would very much appreciate answers from experienced programmers with knowledge of Django. Thank you.
For
1) The standard get-go solution for timed and background task is Celery which has Django integration. There are others, like Huey https://github.com/coleifer/huey
2) The usual solution is that each row contains user_id column for which this data belongs to. This maps to User model using Django ORM's ForeignKey field. Do your users to need to query the database directly or do they have direct database accounts? If not then this solution should be enough. It sounds like it your front end has 1 database connection and all permission logic is handled by the front end, not the database itself.
3) See 2

How to interface with another database effectively using python

I have an application that needs to interface with another app's database. I have read access but not write.
Currently I'm using sql statements via pyodbc to grab the rows and using python manipulate the data. Since I don't cache anything this can be quite costly.
I'm thinking of using an ORM to solve my problem. The question is if I use an ORM like "sql alchemy" would it be smart enough to pick up changes in the other database?
E.g. sql alchemy accesses a table and retrieves a row. If that row got modified outside of sql alchemy would it be smart enough to pick it up?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Edit: To be more clear
I have one application that is simply a reporting tool lets call App A.
I have another application that handles various financial transactions called App B.
A has access to B's database to retrieve the transactions and generates various reports. There's hundreds of thousands of transactions. We're currently caching this info manually in python, if we need an updated report we refresh the cache. If we get rid of the cache, the sql queries combined with the calculations becomes unscalable.
I don't think an ORM is the solution to your problem of performance. By default ORMs tend to be less efficient than row SQL because they might fetch data that you're not going to use (eg. doing a SELECT * when you need only one field), although SQLAlchemy allows fine-grained control over the SQL generated.
Now to implement a caching mechanism, depending on your application, you could use a simple dictionary in memory or a specialized system such as memcached or Redis.
To keep your cached data relatively fresh, you can poll the source at regular intervals, which might be OK if your application can tolerate a little delay. Otherwise you'll need the application that has write access to the db to notify your application or your cache system when an update occurs.
Edit: since you seem to have control over app B, and you've already got a cache system in app A, the simplest way to solve your problem is probably to create a callback in app A that app B can call to expire cached items. Both apps need to agree on a convention to identify cached items.

Categories