How to create summary statistics for an entire SQLite data base? - python

Consider some SQLite database.db, with a large number of tables and columns.
Panda's .describe() produces the summary statistics that I want (see below). However, it requires reading each table in full - a problem for large data bases. Is there an (SQL or Python) alternative that is less memory hungry? Specifiying column names manually is not feasible here.
import pandas as pd
import sqlite3
con = sqlite3.connect("file:database.db", uri=True)
tables = pd.read_sql("SELECT name FROM sqlite_master WHERE type='table'", con)
columns = []
for _, row in tables.iterrows():
col = pd.read_sql(f"PRAGMA table_info({row['name']})", con)
col['table'] = row['name']
stats = pd.read_sql(f"""SELECT * FROM {row['name']}""", con)
stats = stats.describe(include='all')
stats = stats.transpose()
col = col.merge(stats, left_on='name', right_index=True)
columns.append(col)
columns = pd.concat(columns)

Perhaps a little over the top but you could use TRIGGERS to maintain statistics and eliminate the need for full table scans. There will obviously be some overhead for maintaining the statistics but the overheads are distributed over time.
Perhaps consider the following demo (in SQL) where there are two main tables tablea and tablex (could be any number of tables). Another table called statistic which will be used to dynamically store statistics.
For each main table 3 triggers are created 1 for when a row is inserted, one for when a row is updated and one for when a row is deleted. So 6 triggers in all for the 2 main tables.
The statistic tables has 5 columns
tablename which is the primary key and holds the name is the table the row stores statistics about
row_count for the number of rows (in theory)
The insert trigger for the respective table increments the row_count
The delete trigger decrements the row_count
insert_count
The insert trigger increments the insert_count
update_count
the update trigger increments the update_count
delete_count
the delete trigger increments the delete_count
All of the triggers first try to insert the respective row for the table with all values using the default of 0. As the tablename is the primary key the INSERT OR IGNORE ensures that the row is only added the once (unless the row is deleted (effectively resetting the stats for the table))
The demo includes some insertions, deletions and updates and finally extraction of the statistics:-
DROP TABLE IF EXISTS tablea;
DROP TABLE IF EXISTS tablex;
DROP TABLE IF EXISTS statistic;
CREATE TABLE IF NOT EXISTS statistic (
tablename TEXT PRIMARY KEY,
row_count INTEGER DEFAULT 0,
insert_count INTEGER DEFAULT 0,
update_count INTEGER DEFAULT 0,
delete_count INTEGER DEFAULT 0
);
CREATE TABLE IF NOT EXISTS tablea (id INTEGER PRIMARY KEY, data1 TEXT);
CREATE TRIGGER IF NOT EXISTS tablea_after_ins AFTER INSERT ON tablea
BEGIN
INSERT OR IGNORE INTO statistic (tablename) VALUES('tablea');
UPDATE statistic SET row_count = row_count +1, insert_count = insert_count + 1 WHERE tablename = 'tablea';
END
;
CREATE TRIGGER IF NOT EXISTS tablea_after_update AFTER UPDATE ON tablea
BEGIN
INSERT OR IGNORE INTO statistic (tablename) VALUES('tablea');
UPDATE statistic SET update_count = update_count + 1 WHERE tablename = 'tablea';
END
;
CREATE TRIGGER IF NOT EXISTS tablea_after_delete AFTER DELETE ON tablea
BEGIN
INSERT OR IGNORE INTO statistic (tablename) VALUES('tablea');
UPDATE statistic SET row_count = row_count -1, delete_count = delete_count + 1 WHERE tablename = 'tablea';
END
;
CREATE TABLE IF NOT EXISTS tablex (id INTEGER PRIMARY KEY, data1 TEXT);
CREATE TRIGGER IF NOT EXISTS tablex_after_ins AFTER INSERT ON tablex
BEGIN
INSERT OR IGNORE INTO statistic (tablename) VALUES('tablex');
UPDATE statistic SET row_count = row_count +1, insert_count = insert_count + 1 WHERE tablename = 'tablex';
END
;
CREATE TRIGGER IF NOT EXISTS tablex_after_update AFTER UPDATE ON tablex
BEGIN
INSERT OR IGNORE INTO statistic (tablename) VALUES('tablex');
UPDATE statistic SET update_count = update_count + 1 WHERE tablename = 'tablex';
END
;
CREATE TRIGGER IF NOT EXISTS tablea_after_delete AFTER DELETE ON tablea
BEGIN
INSERT OR IGNORE INTO statistic (tablename) VALUES('tablex');
UPDATE statistic SET row_count = row_count -1, delete_count = delete_count + 1 WHERE tablename = 'tablex';
END
;
INSERT INTO tablea (data1) VALUES('a');
INSERT INTO tablea (data1) VALUES('b'),('c'),('d'),('z');
DELETE FROM tablea WHERE data1 LIKE 'z';
UPDATE tablea set data1 = 'letter_'||data1;
DELETE FROM tablea WHERE data1 LIKE '%_c';
INSERT OR IGNORE INTO tablex (data1) VALUES
('1a'),('2a'),('3a'),('4a'),('5a')
,('1b'),('2b'),('3b'),('4b'),('5b')
,('1c'),('2c'),('3c'),('4c'),('5c')
,('1d'),('2d'),('3d'),('4d'),('5d')
;
SELECT * FROM statistic;
/* Cleanup the demo environment */
DROP TABLE IF EXISTS tablea;
DROP TABLE IF EXISTS tablex;
DROP TABLE IF EXISTS statistic;
When run the result is :-
note that the mass insert into tablex records all 20 rows added (i.e the trigger is triggered for every insert and that the triggering is part of the transaction)

Related

Insert unmatched records to another database connection

I have 2 database connections. I want to compare a single table from each connection to each other. And if there are unmatched records, I want to add them to the table database where they are missing.
This is what I came up with but id doesn't seem to do the inserting part. I'm new to python excuse the code thanks.
# establishing connections and querying the database
import sqlite3
con1 = sqlite3.connect("database1.db")
cur1 = con1.cursor()
table1 = cur1.execute("SELECT * FROM table1")
fetch_table1 = table1.fetchall()
mylist = list(table1)
con2 = sqlite3.connect("database2.db")
cur2 = con2.cursor()
table2= cur2.execute("SELECT * FROM table2")
table2 = table2.fetchall()
mylist2 = list(table2).
# finding unmatched eliminates and inserting them to the database
def non_match_elements(mylist2, mylist):
non_match = []
for i in mylist2:
if i not in mylist:
non_match.append(i)
non_match = non_match_elements(mylist2, mylist)
cur1.executemany("""INSERT INTO table 1 VALUES (?,?,?)""", non_match)
con1.commit()
res = cur1.execute("select column from table1")
print(res.fetchall())
Thanks again guys
I would suggest ATTACHing one connection to the other, you then the have two INSERT INTO table SELECT * FROM table2 WHERE query that would insert from one table to the other.
here's an example/demo (not of the ATTACH DATABASE but of aligning two tables with the same schema but with different data):-
/* Cleanup - just in case*/
DROP TABLE IF EXISTS table1;
DROP TABLE IF EXISTS table2;
/* Create the two tables */
CREATE TABLE IF NOT EXISTS table1 (val1 TEXT, val2 TEXT, val3 TEXT);
CREATE TABLE IF NOT EXISTS table2 (val1 TEXT, val2 TEXT, val3 TEXT);
/*****************************************************************/
/* load the two different sets of data and also some common data */
INSERT INTO table1 VALUES ('A','AA','AAA'),('B','BB','BBB'),('C','CC','CCC'),('M','MM','MMM');
INSERT INTO table2 VALUES ('X','XX','XXX'),('Y','YY','YYY'),('Z','ZZ','ZZZ'),('M','MM','MMM');
/*************************************************************/
/* Macth each table to the other using an INSERT .... SELECT */
/*************************************************************/
INSERT INTO table1 SELECT * FROM table2 WHERE val1||val2||val3 NOT IN (SELECT(val1||val2||val3) FROM table1);
INSERT INTO table2 SELECT * FROM table1 WHERE val1||val2||val3 NOT IN (SELECT(val1||val2||val3) FROM table2);
/* Output both tables */
SELECT 'T1',* FROM table1;
SELECT 'T2',* FROM table1;
/* Cleanup */
DROP TABLE IF EXISTS table1;
DROP TABLE IF EXISTS table2;
The results of the 2 SELECTS being :-
and
the first column (T1 or T2) just being used to indicate which table the SELECT is from.
table1 has the X,Y and Z values rows copied from table2
table2 has the A,B and C values rows copied from table1
the M values row, as they exist in both remain intact, they are neither duplicated nor deleted.
Thus data wise the two tables are identical.

Executing an SQL update statement from a pandas dataframe

Context: I am using MSSQL, pandas, and pyodbc.
Steps:
Obtain dataframe from query using pyodbc (no problemo)
Process columns to generate the context of a new (but already existing) column
Fill an auxilliary column with UPDATE statements (i.e. UPDATE t SET t.value = df.value FROM dbo.table t where t.ID = df.ID)
Now how do I execute the sql code in the auxilliary column, without looping through each row?
sample data
The first two columns are obtained by querying dbo.table, the third columns exists but is empty in the database. The fourth column only exists in the dataframe to prepare the SQL statement that would correspond to updating dbo.table
ID
raw
processed
strSQL
1
lorum.ipsum#test.com
lorum ipsum
UPDATE t SET t.processed = 'lorum ipsum' FROM dbo.table t WHERE t.ID = 1
2
rumlo.sumip#test.com
rumlo sumip
UPDATE t SET t.processed = 'rumlo sumip' FROM dbo.table t WHERE t.ID = 2
3
...
...
...
I would like to execute the SQL script in each row in an efficient manner.
After I recommended .executemany() in a comment to the question, a subsequent comment from #Charlieface suggested that a table-valued parameter (TVP) would provide even better performance. I didn't think it would make that much difference, but I was wrong.
For an existing table named MillionRows
ID TextField
-- ---------
1 foo
2 bar
3 baz
…
and example data of the form
num_rows = 1_000_000
rows = [(f"text{x:06}", x + 1) for x in range(num_rows)]
print(rows)
# [('text000000', 1), ('text000001', 2), ('text000002', 3), …]
my test using a standard executemany() call with cnxn.autocommit = False and crsr.fast_executemany = True
crsr.executemany("UPDATE MillionRows SET TextField = ? WHERE ID = ?", rows)
took about 180 seconds (3 minutes).
However, by creating a user-defined table type
CREATE TYPE dbo.TextField_ID AS TABLE
(
TextField nvarchar(255) NULL,
ID int NOT NULL,
PRIMARY KEY (ID)
)
and a stored procedure
CREATE PROCEDURE [dbo].[mr_update]
#tbl dbo.TextField_ID READONLY
AS
BEGIN
SET NOCOUNT ON;
UPDATE MillionRows SET TextField = t.TextField
FROM MillionRows mr INNER JOIN #tbl t ON mr.ID = t.ID
END
when I used
crsr.execute("{CALL mr_update (?)}", (rows,))
it did the same update in approximately 80 seconds (less than half the time).

Update Firebird table with data from csv Python

Let's say we have a CSV file with payment information like this:
somthing1:500.00
somthing2:300.00
somthing3:200.00
I need to update different rows in a tableB in a Firebird database that has a column called paid with the values in the second column in the CSV file. For that reason iI created a global temporary table just to pump the data in it and then to update the desired table. What I tried is this:
idd = 0
with open('file.csv', 'r', encoding='utf-8', newline='') as file:
reader = list(csv.reader(file, delimiter=':'))
for row in reader:
idd += 1
c.execute("""INSERT INTO temp_table (id,code,paid) VALUES (?,?,?)""",
(idd,str(row[0]),str(row[1]), ))
This works as expected - the table is populated. After that I tried to update other table like this:
c.execute("""select paid from temp_table;""")
res = c.fetchall()
for r in res:
c.execute(f"""UPDATE tableB SET paid = {r[0]} WHERE oid = 10;""")
This does work - for every result from the SELECT query it updates all of the rows with the same value of the next result from that. I tried with MERGE in the database itself with the same result - every row is updated with the same value from temp_table:
MERGE INTO tableB b
USING (SELECT paid FROM temp_table) e
ON b.paid = 0 AND oid = 10
WHEN MATCHED THEN
UPDATE SET b.paid = e.paid;
I need some way to update first row from tableB with first row from the CSV and so on. What am I missing?
The problem is that you are not correlating rows from the temp_table with rows in the tableB.
In your initial attempt, you update all rows in tableB with oid = 10. Similarly, in your second attempt you update all rows with paid = 0 without correlating to temp_table.
You need add a condition to match rows between the tables, say both have an id column:
MERGE INTO tableB b
USING (SELECT paid FROM temp_table) e
ON b.paid = 0 AND b.oid = 10 AND b.id = e.id
WHEN MATCHED THEN
UPDATE SET b.paid = e.paid;

automatically insert only one row in table after calculating the sum of column

i have 3 table in my database
CREATE TABLE IF NOT EXISTS depances (
id SERIAL PRIMARY KEY UNIQUE NOT NULL,
type VARCHAR NOT NULL,
nom VARCHAR,
montant DECIMAL(100,2) NOT NULL,
date DATE,
temp TIME)
CREATE TABLE IF NOT EXISTS transactions (
id SERIAL PRIMARY KEY UNIQUE NOT NULL,
montant DECIMAL(100,2),
medecin VARCHAR,
patient VARCHAR,
acte VARCHAR,
date_d DATE,
time_d TIME,
users_id INTEGER)
CREATE TABLE IF NOT EXISTS total_jr (
id SERIAL PRIMARY KEY UNIQUE NOT NULL,
total_revenu DECIMAL(100,2),
total_depance DECIMAL(100,2),
total_différence DECIMAL(100,2),
date DATE)
my idea is to insert defrent value in table depances and transaction using a GUI interface.
and after that adding the SUM of montant.depances in total_depance.total_jr
and the SUM of montant.transactions in total_revenu.total_jr where all rows have the same time
that's the easy part using this code
self.cur.execute( '''SELECT SUM(montant) AS totalsum FROM depances WHERE date = %s''',(date,))
result = self.cur.fetchall()
for i in result:
o = i[0]
self.cur_t = self.connection.cursor()
self.cur_t.execute( '''INSERT INTO total_jr(total_depance)
VALUES (%s)'''
, (o,))
self.connection.commit()
self.cur.execute( '''UPDATE total_jr SET total_depance = %s WHERE date = %s''',(o, date))
self.connection.commit()
But every time it adds a new row to the table of total_jr
How can i add thos value of SUM(montant) to the table where the date is the same every time its only put the value of sum in one row not every time it add a new row
The result should will be like this
id|total_revenu|total_depance|total_différence|date
--+------------+-------------+----------------+----
1 sum(montant1) value value 08/07/2020
2 sum(montant2) value value 08/09/2020
3 sum(montant3) value value 08/10/2020
but it only give me this result
id|total_revenu|total_depance|total_différence|date
--+------------+-------------+----------------+----
1 1 value value 08/07/2020
2 2 value value 08/07/2020
3 3 value value 08/7/2020
if there is any idea or any hit that will be hulpefull
You didn't mention which DBMS or SQL module you're using so I'm guessing MySQL.
In your process, run the update first and check how many rows were changed. If zero row changed, then insert a new row for that date.
self.cur.execute( '''SELECT SUM(montant) AS totalsum FROM depances WHERE date = %s''',(date,))
result = self.cur.fetchall()
for i in result:
o = i[0]
self.cur.execute( '''UPDATE total_jr SET total_depance = %s WHERE date = %s''',(o, date))
rowcnt = self.cur.rowcount # number of rows updated - psycopg2
self.connection.commit()
if rowcnt == 0: # no rows updated, need to insert new row
self.cur_t = self.connection.cursor()
self.cur_t.execute( '''INSERT INTO total_jr(total_depance, date)
VALUES (%s, %s)'''
, (o, date))
self.connection.commit()
I find a solution for anyone who need it in future first of all we need to update the table
create_table_total_jr = ''' CREATE TABLE IF NOT EXISTS total_jr (
id SERIAL PRIMARY KEY UNIQUE NOT NULL,
total_revenu DECIMAL(100,2),
total_depance DECIMAL(100,2),
total_différence DECIMAL(100,2),
date DATE UNIQUE)''' #add unique to the date
and after that we use the UPSERT and ON CONFLICT
self.cur_t.execute( ''' INSERT INTO total_jr(date) VALUES (%s)
ON CONFLICT (date) DO NOTHING''', (date,))
self.connection.commit()
with this code when there is an insert value with the same date it will do nothing
after that we update the value of the SUM
self.cur.execute( '''UPDATE total_jr SET total_depance = %s WHERE date = %s''',(o, date))
self.connection.commit()
Special thanks to Mike67 for his help
You do not need 2 database calls for this. As #Mike67 suggested UPSERT functionality is what you want. However, you need to send both date and total_depance. In SQL that becomes:
insert into total_jr(date,total_depance)
values (date_value, total_value
on conflict (date)
do update
set total_depance = excluded.total_depance;
or depending on input total_depance just the transaction value while on the table total_depance is an accumulation:
insert into total_jr(date,total_depance)
values (date_value, total_value
on conflict (date)
do update
set total_depance = total_depance + excluded.total_depance;
I believe your code then becomes something like (assuming the 1st insert is correct)
self.cur_t.execute( ''' INSERT INTO total_jr(date,total_depance) VALUES (%s1,$s2)
ON CONFLICT (date) DO UPDATE set total_depance = excluded.$s2''',(date,total_depance))
self.connection.commit()
But that could off, you will need to verify.
Tip of the day: You should change the column name date to something else. Date is a reserved word in both Postgres and the SQL Standard. It has predefined meanings based on its context. While you may get away with using it as a data name Postgres still has the right to change that at any time without notice, unlikely but still true. If so, then your code (and most code using that/those table(s)) fails, and tracking down why becomes extremely difficult. Basic rule do not use reserved words as data names; using reserved words as data or db object names is a bug just waiting to bite.

Load bulk into PostgreSQLfrom log files using Python

This is a follow-up question. Below is a piece of my Python script that reads a constantly growing log files (text) and insert data into Postgresql DB. New log file generated each day. What I do is I commit each line which cuases a huge load and a really poor performance (needs 4 hours to insert 30 min of the file data!). How can I improve this code to insert bulks insead of lines? and would this help improve the performance and reduce load? I've read about copy_from but couldn't figure out how to use it in such situation.
import psycopg2 as psycopg
try:
connectStr = "dbname='postgis20' user='postgres' password='' host='localhost'"
cx = psycopg.connect(connectStr)
cu = cx.cursor()
logging.info("connected to DB")
except:
logging.error("could not connect to the database")
import time
file = open('textfile.log', 'r')
while 1:
where = file.tell()
line = file.readline()
if not line:
time.sleep(1)
file.seek(where)
else:
print line, # already has newline
dodecode(line)
------------
def dodecode(fields):
global cx
from time import strftime, gmtime
from calendar import timegm
import os
msg = fields.split(',')
part = eval(msg[2])
msgnum = int(msg[3:6])
print "message#:", msgnum
print fields
if (part==1):
if msgnum==1:
msg1 = msg_1.decode(bv)
#print "message1 :",msg1
Insert(msgnum,time,msg1)
elif msgnum==2:
msg2 = msg_2.decode(bv)
#print "message2 :",msg2
Insert(msgnum,time,msg2)
elif msgnum==3:
....
....
....
----------------
def Insert(msgnum,time,msg):
global cx
try:
if msgnum in [1,2,3]:
if msg['type']==0:
cu.execute("INSERT INTO table1 ( messageid, timestamp, userid, position, text ) SELECT "+str(msgnum)+", '"+time+"', "+str(msg['UserID'])+", ST_GeomFromText('POINT("+str(float(msg['longitude']), '"+text+"')+" "+str(float(msg['latitude']))+")']))+" WHERE NOT EXISTS (SELECT * FROM table1 WHERE timestamp='"+time+"' AND text='"+text+"';")
cu.execute("INSERT INTO table2 ( field1,field2,field3, time_stamp, pos,) SELECT "+str(msg['UserID'])+","+str(int(msg['UserName']))+","+str(int(msg['UserIO']))+", '"+time+"', ST_GeomFromText('POINT("+str(float(msg['longitude']))+" "+str(float(msg['latitude']))+")')," WHERE NOT EXISTS (SELECT * FROM table2 WHERE field1="+str(msg['UserID'])+");")
cu.execute("Update table2 SET field3='"+str(int(msg['UserIO']))+"',time_stamp='"+str(time)+"',pos=ST_GeomFromText('POINT("+str(float(msg['longitude']))+" "+str(float(msg['latitude']))+")'),"' WHERE field1='"+str(msg['UserID'])+"' AND time_stamp < '"+str(time)+"';")
elif msg['type']==1:
cu.execute("INSERT INTO table1 ( messageid, timestamp, userid, position, text ) SELECT "+str(msgnum)+", '"+time+"', "+str(msg['UserID'])+", ST_GeomFromText('POINT("+str(float(msg['longitude']), '"+text+"')+" "+str(float(msg['latitude']))+")']))+" WHERE NOT EXISTS (SELECT * FROM table1 WHERE timestamp='"+time+"' AND text='"+text+"';")
cu.execute("INSERT INTO table2 ( field1,field2,field3, time_stamp, pos,) SELECT "+str(msg['UserID'])+","+str(int(msg['UserName']))+","+str(int(msg['UserIO']))+", '"+time+"', ST_GeomFromText('POINT("+str(float(msg['longitude']))+" "+str(float(msg['latitude']))+")')," WHERE NOT EXISTS (SELECT * FROM table2 WHERE field1="+str(msg['UserID'])+");")
cu.execute("Update table2 SET field3='"+str(int(msg['UserIO']))+"',time_stamp='"+str(time)+"',pos=ST_GeomFromText('POINT("+str(float(msg['longitude']))+" "+str(float(msg['latitude']))+")'),"' WHERE field1='"+str(msg['UserID'])+"' AND time_stamp < '"+str(time)+"';")
elif msg['type']==2:
....
....
....
except Exception, err:
#print('ERROR: %s\n' % str(err))
logging.error('ERROR: %s\n' % str(err))
cx.commit()
cx.commit()
doing multiple rows per transaction, and per query will make it go faster,
when faced with a similar problem I put multiple rows in the values part of the insert query,
but you have complicated insert queries, so you'll likely need a different approach.
I'd suggest creating a temporary table and inserting say 10000 rows into it with ordinary multi-row inserts
insert into temptable values ( /* row1 data */ ) ,( /* row2 data */ ) etc...
500 rows per insert.is a good starting point.
then joining the temp table with the existing data to de-dupe it.
delete from temptable using livetable where /* .join condition */ ;
and de-duping it against itself if that is needed too
delete from temptable where id not in
( select distinct on ( /* unique columns */) id from temptable);
then using insert-select to copy the rows from the temporary table into the live table
insert into livetable ( /* columns */ )
select /* columns */ from temptable;
it looks like you might need an update-from too
and finally dropping the temp table and starting again.
ans you're writing two tables you;re going to need to double-up all these operations.
I'd do the insert by maintaing a count and a list of values to insert and then at insert time
building a repeating the (%s,%s,%s,%s) part ot the query as many times as needed and passing the list of values in separately and letting psycopg2 deal with the formatting.
I'd expect making those changes could get you a speed up of 5 times for more

Categories