How to persist an inmemory monetdbe db to local disk

How to persist an inmemory monetdbe db to local disk - python

I am using an embedded monetdb database in python using Monetdbe.
I can see how to create a new connection with the :memory: setting
But i cant see a way to persist the created database and tables for use later.
Once an in memory session ends, all data is lost.
So i have two questions:
Is there a way to persist an in memory db to local disk
and
Once an in memory db has been saved to local disk, is it possible to load the db to memory at a later point to allow fast data analytics. At the moment it looks like if i create a connection from a file location, then my queries are reading from local disk rather memory.

It is a little bit hidden away admittedly, but you can check out the following code snipet from the movies.py example in the monetdbe-examples repository:
import monetdbe
database = '/tmp/movies.mdbe'
with monetdbe.connect(database) as conn:
conn.set_autocommit(True)
conn.execute(
"""CREATE TABLE Movies
(id SERIAL, title TEXT NOT NULL, "year" INTEGER NOT NULL)""")
So in this example the single argument to connect is just the desired path to your database directory. This is how you can (re)start a database that stores its data in a persistent way on a file system.
Notice that I have intentionally removed the python lines from the example in the actual repo that start with the comment # Removes the database if it already exists. Just to make the example in the answer persistent.
I haven't run the code but I expect that if you run this code twice consecutively the second run wil return a database error on the execute statement as the movies table should already be there.
And just to be sure, don't use the /tmp directory if you want your data to persist between restarts of your computer.

Related

pg_dump and pg_restore between different servers with a selection criteria on the data to be dumped

Currently trying to use pg_dump and pg_restore to be able to dump select rows from a production server to a testing server. The goal is to have a testing server and database that contains the subset of data selected, moreover through a python script, I want the ability to restore the database that original subset after testing and potentially modifying the contents of the database.
From my understanding of pg_dump and pg_restore, the databases that they interact with must be of the same dbname. Moreover, a selection criteria should be made with a the COPY command. Hence, my idea is to have 2 databases in my production server, one with the large set of data and one with the selected set. Then, name the smaller set db 'test' and restore it to the 'test' db in the test server.
Is there a better way to do this considering I don't want to keep the secondary db in my production server and will need to potentially make changes to the selected subset in the future.

From my understanding of pg_dump and pg_restore, the databases that they interact with must be of the same dbname.
The databases being worked with only have to have the same name if you are using --create. Otherwise each programs operates in whatever database was specified when it was invoked, which can be different.
The rest of your question is too vague to be addressable. Maybe pg_dump/pg_restore are the wrong tools for this, and just using COPY...TO and COPY...FROM would be more suitable.

How to copy data in airflow

I am using appache airflow in my project. In this user can connect their data base with our project and copy their table to our data base .
So I am able to establish a connection using the following lines
import json
from airflow.models.connection import Connection
c = Connection(
conn_id='some_conn',
conn_type='mysql',
description='connection description',
host='myhost.com',
login='myname',
schema = 'myschema'
password='mypassword',
extra=json.dumps(dict(this_param='some val', that_param='other val*')),
)
print(f"AIRFLOW_CONN_{c.conn_id.upper()}='{c.get_uri()}'")
hook = MySqlHook(c.conn_id)
result = hook.get_records(f'SELECT table_name FROM information_schema.tables WHERE table_schema = {c.schema};')
Now I am able to get the table names associated with the connected data base ....
How to copy data from this connected data base to our data base .... Please help me with some hints on this

This depends on what databases you want to copy data between.
A straightforward approach could be outlined as the following steps.
Grab the records from Database A.
Insert the records to Database B.
You would create a custom operator that would perform those steps in order. There might even be operators already created that fulfill these functions. I would advice you to take a look in the Airflow Github first.
Please note that this approach is not suitable for large datasets because the data is stored in memory during the task execution. You can also write to disk but that route then depends on the machine that the Airflow worker runs on.
If the database lives in the same cluster/server then a simple SQL script would work. A HiveOperator, for example, would be sufficient to move data with some INSERT INTO sql commands.

How to check if directory is valid Postgres database

I spinup postgres container and its data path /var/lib/postgresql/data is mapped to my local using volumes. As soon as container is up and database is setup the local path populates with all db data. I need to some how check programatically (using Python) if local location is proper postgres db data. This is needed if I need to create tables or not. I create if local directory is blank or invalid postgres data and I don't if it is valid. The reason I am trying to achieve this is if I want to hook up local db created due to postgers_container_1 to postgres_container_2

If the file /var/lib/postgresql/data/PG_VERSION exists, then it's probably a valid data directory. This is the first thing Postgres will check when you try to start the server.
Of course, there are many, many other things required to make it a valid data directory - too many to check by yourself. If you need to be 100% sure, the only practical way is to start the Postgres server and try to connect to it.

Temporary SQlite malformed pragma / disk image

We use SQLite databases to store the results coming out of data analysis pipelines. The database files sit on our high performance scale-out filestore which is connected to the same switch as our cluster nodes to ensure a good connection.
However recently I've been having trouble querying the database via python. This is particularly the case when many jobs are trying to query the database at once. I get error messages such as
sqlite3.DatabaseError: malformed database schema (primary_digest_joint) - index primary_digest_joint already exists
or
sqlite3.DatabaseError: database disk image is malformed
Note that these jobs are only reading the database, not writing to it (nothing is writing to the database), which I thought should be fine with SQLite.
Generally if I stop the pipeline, I can access the database fine and it appears to be perfectly intact. If I restart the pipeline again a number of jobs will successfully complete before I get the error again.
Any idea why this is happening or what can be done to stop it? Is there any chance that the database is actually being damaged, even though I seems to be fine, and running a PRAGMA integrity_check doesn't suggest anything is wrong.

How to force SQLAlchemy to update rows

I'm currently using SQLAlchemy with two distinct session objects. In one object, I am inserting rows into a mysql database. In the other session I am querying that database for the max row id. However, the second session is not querying the latest from the database. If I query the database manually, I see the correct, higher max row id.
How can I force the second session to query the live database?

The first session needs to commit to flush changes to the database.
first_session.commit()
Session holds all the objects in memory and flushes them together to the database (lazy loading, for efficiency). Thus the changes made by first_session are not visible to the second_session which is reading data from the database.

Had a similar problem, for some reason i had to commit both sessions. Even the one that is only reading.
This might be a problem with my code though, cannot use same session as it the code will run on different machines. Also documentation of SQLalchemy says that each session should be used by one thread only, although 1 reading and 1 writing should not be a problem.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to persist an inmemory monetdbe db to local disk - python

Related

pg_dump and pg_restore between different servers with a selection criteria on the data to be dumped

How to copy data in airflow

How to check if directory is valid Postgres database

Temporary SQlite malformed pragma / disk image

How to force SQLAlchemy to update rows

Categories

Resources