I understand how to save a redis database using bgsave. However, once my database server restarts, how do I tell if a saved database is present and how do I load it into my application. I can tolerate a few minutes of lost data, so I don't need to worry about an AOF, but I cannot tolerate the loss of, say, an hour's worth of data. So doing a bgsave once an hour would work for me. I just don't see how to reload the data back into the database.
If it makes a difference, I am working in Python.
You can stop redis and replace dump.rdb in /var/lib/redis (or whatever file is in the dbfilename variable in your redis.conf). Then start redis again.
Related
I want to create a Python3 program that takes in MySQL data and holds it temporarily, and can then pass this data onto a cloud MySQL database.
The idea would be that it acts as a buffer for entries in the event that my local network goes down, the buffer would then be able to pass those entries on at a later date, theoretically providing fault-tolerance.
I have done some research into Replication and GTIDs and I'm currently in the process of learning these concepts. However I would like to write my own solution, or at least have it be a smaller program rather than a full implementation of replication server-side.
I already have a program that generates some MySQL data to fill my DB, the key part I need help with would be the buffer aspect/implementation (The code itself I have isn't important as I can rework it later on).
I would greatly appreciate any good resources or help, thank you!
I would implement what you describe using a message queue.
Example: https://hevodata.com/learn/python-message-queue/
The idea is to run a message queue service on your local computer. Your Python application pushes items into the MQ instead of committing directly to the database.
Then you need another background task, called a worker, which you may also write in Python or another language, which consumes items from the MQ and writes them to the cloud database when it's available. If the cloud database is not available, then the background worker pauses.
The data in the MQ can grow while the background worker is paused. If this goes on too long, you may run out of space. But hopefully the rate of growth is slow enough and the cloud database is available regularly, so the risk of this happening is low.
Re your comment about performance.
This is a different application architecture, so there are pros and cons.
On the one hand, if your application is "writing" to a local MQ instead of the remote database, it's likely to appear to the app as if writes have lower latency.
On the other hand, posting to the MQ does not write to the database immediately. There still needs to be a step of the worker pulling an item and initiating its own write to the database. So from the application's point of view, there is a brief delay before the data appears in the database, even when the database seems available.
So the app can't depend on the data being ready to be queried immediately after the app pushes it to the MQ. That is, it might be pretty prompt, under 1 second, but that's not the same as writing to the database directly, which ensures that the data is ready to be queried immediately after the write.
The performance of the worker writing the item to the database should be identical to that of the app writing that same item to the same database. From the database perspective, nothing has changed.
I have an application which was running very quickly. Let's say it took 10 seconds to run. All it does is read a csv, parse it, and store some of the info in sqlalchemy objects which are written to the database. (We never attempt to read the database, only to write).
After adding a many to many relationship to the entity we are building and relating it to an address entity which we now build, the time to process the file has increased by an order of magnitude. We are doing very little additional work: just instantiating an address and storing it in the relationship collection on our entity using append.
Most of the time appears to be lost in _load_for_state as can be seen in the attached profiling screenshot:
I'm pretty sure this is unnecessary lost time, because it looks like it is trying to do some loading even though we never make any queries of the database (we always instantiate new objects and save them in this app).
Anyone have an idea how to optimize sqlalchemy here?
update
I tried setting SQLALCHEMY_ECHO = True just to see if it is doing a bunch of database reads, or maybe some extra writes. Bizarrely, it only accesses the database itself at the same times it did before (following a db.session.commit()). I'm pretty sure all this extra time is not being spent due to database access.
My colleague run a script that pulls data from the db periodically. He is using the query:
SELECT url, data FROM table LIMIT {} OFFSET {}'.format( OFFSET, PAGE * OFFSET
We use Amazon AURORAS and he has his own slaves server but everytime it touches 98%+
Table have millions of records.
Would it be nice if we go for sqldump instead of SQL queries for fetching data?
The options come in my mind are:
SQL DUMP of selective tables( not sure of benchmark)
Federate tables based on certain reference(date, ID etc)
Thanks
I'm making some fairly big assumptions here, but from
without choking it
I'm guessing you mean that when your colleague runs the SELECT to grab the large amount of data, the database performance drops for all other operations - presumably your primary application - while the data is being prepared for export.
You mentioned SQL Dump so I'm also assuming that this colleague will be satisfied with data that is roughly correct, ie: it doesn't have to be up to the instant transactionally correct data. Just good enough for something like analytics work.
If those assumptions are close, your colleague and your database might benefit from
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED
This line of code should be used carefully and almost never in a line of business application but it can help people querying the live database with big queries, as long as you fully understand the implications.
To use it, simply start a transaction and put this line before any queries you run.
The 'choking'
What you are seeing when your colleague runs a large query is record locking. Your database engine is - quite correctly - set up to provide an accurate view of your data, at any point. So, when a large query comes along the database engine first waits for all write locks (transactions) to clear, runs the large query and holds all future write locks until the query has run.
This actually happens for all transactions, but you only really notice it for the big ones.
What READ UNCOMMITTED does
By setting the transaction isolation level to READ UNCOMMITTED, you are telling the database engine that this transaction doesn't care about write locks and to go ahead and read anyway.
This is known as a 'dirty read', in that the long-running query could well read a table with a write lock on it and will ignore the lock. The data actually read could be the data before the write transaction has completed, or a different transaction could start and modify records before this query gets to it.
The data returned from anything with READ UNCOMMITTED is not guaranteed to be correct in the ACID sense of a database engine, but for some use cases it is good enough.
What the effect is
Your large queries magically run faster and don't lock the database while they are running.
Use with caution and understand what it does before you use it though.
MySQL Manual on transaction isolation levels
We are try to cover tests on an old, big project which has more than 500 tables in database, and that waste too much time on database creating(more than 1 hour on my RMBP) and db migrations.
We are using PostgreSQL as the Database, cause some GIS about service needs it, so it's hard to use SQLite to replace it.
What I can do to decrease the time on testing preparation?
You can use django-nose and reuse the database like this:
REUSE_DB=1 ./manage.py test
Be careful that your tests do not leave any junk in the DB. Have a look at the documentation for more info.
At some point I ended up creating a transaction management middleware that would intercept transaction calls so that all tests were run in a transaction, and then the transaction was rolled back at the end.
Another alternative is to have a binary database dump that gets loaded at the beginning of each test, and then the database is dropped and recreated between tests. After creating a good database, use xtrabackup to create a dump of it. Then, in a per-test setup function, drop and create the database, then use xtrabackup to load the dump. Since it's a binary dump it'll load fairly quickly.
I would like to optimize my system, to be able to handle large amount of users down the road. Even if website never gets to be popular, I want to do things right.
Anyway, I am currently using a combo of 2 database solutions:
1.) Either SQL (mysql, postgre) via SQLAlchemy OR MongoDB
2.) Redis
I use Redis as 'hot' database (as its much much faster and unloads stress on primary database solution), and than sync data between two via cron tasks. I use Redis for session management, statistics etc. However, if my Redis server would crash, site would remain operational (fallback to sql/mongo).
So this is my design for data. Now I would like to do proper connecting.
As both sql/mongo and redis are required on 99% of pages, my current design is the following:
- When new HTTP request comes in, I connect to all databases
- When page finishes rendering, I disconnect from databases
Now obviously I am doing a lot of connecting/disconnecting. I've calculated that this model could sustain a decent amount of visitors, however I am wondering if there is a better way to do this.
Would persisting connections between requests improve performance/load or would the sheer amount of open connections clog the server?
Would you recommend creating a connection pool? If so, when should the connection pool be created and how should the Model access it (or fetch connection objects from it).
I am sorry if these questions are stupid, but I am a newbie.
I don't think that it is a good way to optimize things beforehand. You don't know where bottlenecks will apear and you are probably just wasting time for things you won't need in future mostly.
Database type can be changed later if you will use ORM, so right now you can use any. Anyway if your site popularity will raise high, you will need to get more servers, add some task queues (celery) etc. There is ton of things you can do later to optimize. Right now you should just focus on making your site popular and use technologies that can scale in future.
If you are going to leave connections open, you should definitely consider pooling to avoid crudding up the system with per-session connections or something of the like (as long as they are locked properly to avoid leaking). That said, the necessity of doing this isn't clear. If you can quantify the system with some average/worst-case connection times to the databases, you'd be able to make a much more informed decision.
Try running a script(s) to hammer your system and investigate DB related timing. This should help you make an immediate decision about whether to keep persistent connections and a handy DB load script for later on.