I want to create a SQLite3 database using python and wanted to know what is the best method of managing it? When I say managing, I mean how to ensure that there is a backup if/when there is some kind of corruption. Is there any way to retrieve it or is it better to have a backup of your whole database? What is a best practice?
TIA!
Based on comments, the database is currently planned to be SQLite3. That narrows down the scope considerably. Essentially, the database is, as far as operating system issues (processes, data file access, etc.) simply a file created and updated by the Python program directly. Which means any backup/restore is normally done by a full copy of the database file(s) to either another location on the same machine (handles corruption due to program problems but not hardware problems, OS crash, etc.) or over a network to either another local machine or a remote/cloud server of some sort.
If your application needs to run continuously, then you may want to manage backup within the application. If it runs periodically (specific times or via some external trigger) then you can have the backups run on a regular basis outside the application (which, for SQLite3, simply means a script to copy the appropriate file(s)).
Related
I want to create a Python3 program that takes in MySQL data and holds it temporarily, and can then pass this data onto a cloud MySQL database.
The idea would be that it acts as a buffer for entries in the event that my local network goes down, the buffer would then be able to pass those entries on at a later date, theoretically providing fault-tolerance.
I have done some research into Replication and GTIDs and I'm currently in the process of learning these concepts. However I would like to write my own solution, or at least have it be a smaller program rather than a full implementation of replication server-side.
I already have a program that generates some MySQL data to fill my DB, the key part I need help with would be the buffer aspect/implementation (The code itself I have isn't important as I can rework it later on).
I would greatly appreciate any good resources or help, thank you!
I would implement what you describe using a message queue.
Example: https://hevodata.com/learn/python-message-queue/
The idea is to run a message queue service on your local computer. Your Python application pushes items into the MQ instead of committing directly to the database.
Then you need another background task, called a worker, which you may also write in Python or another language, which consumes items from the MQ and writes them to the cloud database when it's available. If the cloud database is not available, then the background worker pauses.
The data in the MQ can grow while the background worker is paused. If this goes on too long, you may run out of space. But hopefully the rate of growth is slow enough and the cloud database is available regularly, so the risk of this happening is low.
Re your comment about performance.
This is a different application architecture, so there are pros and cons.
On the one hand, if your application is "writing" to a local MQ instead of the remote database, it's likely to appear to the app as if writes have lower latency.
On the other hand, posting to the MQ does not write to the database immediately. There still needs to be a step of the worker pulling an item and initiating its own write to the database. So from the application's point of view, there is a brief delay before the data appears in the database, even when the database seems available.
So the app can't depend on the data being ready to be queried immediately after the app pushes it to the MQ. That is, it might be pretty prompt, under 1 second, but that's not the same as writing to the database directly, which ensures that the data is ready to be queried immediately after the write.
The performance of the worker writing the item to the database should be identical to that of the app writing that same item to the same database. From the database perspective, nothing has changed.
I have a code that contains a variable that I want to change manually when I want without stopping the main loop neither pause it (with input()). I can't find a library that allows me to set manually in the run, or access the RAM memory to change that value.
for now I set a file watcher that reads the parameters every 1 minutes but this is inefficient way I presume.
I guess you just want to expose API. You did it with files which works but less common. You can use common best practices such as:
HTTP web-server. You can do it quickly with Flask/bottle.
gRPC
pub/sub mechanism - Redis, Kafka (more complicated, requires another storage solution - the DB itself).
I guess that there are tons of other solution but you got the idea. I hope that's what you're looking for.
With those solution you can manually trigger your endpoint and change whatever you want in your application.
Hello I don't think this is in the right place for this question but I don't know where to ask it. I want to make a website and an api for that website using the same SQLAlchemy database would just running them at the same time independently be safe or would this cause corruption from two write happening at the same time.
SQLA is a python wrapper for SQL. It is not it's own database. If you're running your website (perhaps flask?) and managing your api from the same script, you can simply use the same reference to your instance of SQLA. Meaning, when you use SQLA to connect to a database and save to a variable, what is really happening is it saves the connection to a variable, and you continually reference that variable, as opposed to the more inefficient method of creating a new connection every time. So when you say
using the same SQLAlchemy database
I believe you are actually referring to the actual underlying database itself, not the SQLA wrapper/connection to it.
If your website and API are not running in the same script (or even if they are, depending on how your API handles simultaneous requests), you may encounter a race condition, which, according to Wikipedia, is defined as:
the condition of an electronics, software, or other system where the system's substantive behavior is dependent on the sequence or timing of other uncontrollable events. It becomes a bug when one or more of the possible behaviors is undesirable.
This may be what you are referring to when you mentioned
would this cause corruption from two write happening at the same time.
To avoid such situations, when a process accesses a file, (depending on the OS,) check is performed to see if there is a "lock" on that file, and if so, the OS refuses to open that file. A lock is created when a process accesses a file (and there is no other process holding a lock on that file), such as by using with open(filename): and is released when the process no longer holds an open reference to the file (such as when python execution leaves the with open(filename): indentation block.) This may be the real issue you might encounter when using two simultaneous connections to a SQLite db.
However, if you are using something like MySQL, where you connect to a SQL server process, and NOT a file, since there is no direct access to a file, there will be no lock on the database, and you may run in to that nasty race condition in the following made up scenario:
Stack Overflow queries the reputation an account to see if it should be banned due to negative reputation.
AT THE EXACT SAME TIME, Someone upvotes an answer made by that account that sets it one point under the account ban threshold.
The outcome is now determined by the speed of execution of these 2 tasks.
If the upvoter has, say, a slow computer, and the "upvote" does not get processed by StackOverflow before the reputation query completes, the account will be banned. However, if there is some lag on Stack Overflow's end, and the upvote processes before the account query finishes, the account will not get banned.
The key concept behind this example is that all of these steps can occur within fractions of a second, and the outcome depends of the speed of execution on both ends.
To address the issue of data corruption, most databases have a system in place that properly order database read and writes, however, there are still semantic issues that may arise, such as the example given above.
Two applications can use the same database as the DB is a separate application that will be accessed by each flask app.
What you are asking can be done and is the methodology used by many large web applications, specially when the API is written in a different framework than the main application.
Since SQL databases are ACID compliant, they have a system in place to queue the multiple read/write requests put to it and perform them in the correct order while ensuring data reliability.
One question to ask though is whether it is useful to write two separate applications. For most flask-only projects the best approach would be to separate the project using blueprints, having a “main” blueprint and a “api” blueprint.
I have several scripts. Each of them does some computation and it is completely independent from the others. Once these computations are done, they will be saved to disk and a record updated.
The record is maintained by an instance of a class, which saves itself to disks. I would like to have a single record instance used in multiple scripts (for example, record_manager = RecordManager(file_on_disk). And then record_manager.update(...) ); but I can't do this right now, because when updating the record there may be concurrent write accesses to the same file on disk, leading to data loss. So I have a separate record manager for every script, and then I merge the records manually later.
What is the easiest way to have a single instance used in all the scripts that solves the concurrent write access problem?
I am using macOS (High sierra) and linux (Ubuntu 16.04).
Thanks!
To build a custom solution to this you will probably need to write a short new queuing module. This queuing module will have write access to the file(s) alone and be passed write actions from the existing modules in your code.
The queue logic and logic should be a pretty straightforward queue architecture.
There may also be libraries that exist in python to handle this problem that would avoid you writing your own queue class.
Finally, it is possible that this whole thing will be/could be handled in some way by your OS, independent of python.
I am writing a python service (pyamf) through which a user can access images. All images are stored on a central server. The python services will be running on satellite machines which have network access to server. The service should work as follows:
check locally to see if the file exists, if so, use it.
check locally to see if file is currently being transferred from server ( file.part exists and size is changing ). If so, wait for download to finish, then use file.
if file does not exist and file is not being downloaded, download the file via urlretrieve.
The problem is with Apache's multiple threads. Threads are reaching the file presence check at the same time and therefore they all think the file needs to be downloaded. Needless to say, this is not good.
What is the right way to handle this race condition?
Thanks!
I'm guessing its either a threaded or a forked apache, but the effect would be the same since they are accessing a remote resource.
This problem is sometimes called the "dog pile" problem and its one of the issues addressed by the Beaker caching library (http://beaker.groovie.org). It provides a system bywhich you can create a callable that "creates" a new cached value, in this case a URL corresponding to some image that is fetched, if a value doesn't already exist. Locking is used such that concurrent threads or processes wait for the single process elected as the "creator" to finish what its doing. Beaker will use lockfiles if configured on a unix-like multi-process oriented system or mutexes if on a windows system.
I'm the original author of Beaker's guts along with Ben Bangert who packaged it up for usage with Pylons.