I have 2 database connections. I want to compare a single table from each connection to each other. And if there are unmatched records, I want to add them to the table database where they are missing.
This is what I came up with but id doesn't seem to do the inserting part. I'm new to python excuse the code thanks.
# establishing connections and querying the database
import sqlite3
con1 = sqlite3.connect("database1.db")
cur1 = con1.cursor()
table1 = cur1.execute("SELECT * FROM table1")
fetch_table1 = table1.fetchall()
mylist = list(table1)
con2 = sqlite3.connect("database2.db")
cur2 = con2.cursor()
table2= cur2.execute("SELECT * FROM table2")
table2 = table2.fetchall()
mylist2 = list(table2).
# finding unmatched eliminates and inserting them to the database
def non_match_elements(mylist2, mylist):
non_match = []
for i in mylist2:
if i not in mylist:
non_match.append(i)
non_match = non_match_elements(mylist2, mylist)
cur1.executemany("""INSERT INTO table 1 VALUES (?,?,?)""", non_match)
con1.commit()
res = cur1.execute("select column from table1")
print(res.fetchall())
Thanks again guys
I would suggest ATTACHing one connection to the other, you then the have two INSERT INTO table SELECT * FROM table2 WHERE query that would insert from one table to the other.
here's an example/demo (not of the ATTACH DATABASE but of aligning two tables with the same schema but with different data):-
/* Cleanup - just in case*/
DROP TABLE IF EXISTS table1;
DROP TABLE IF EXISTS table2;
/* Create the two tables */
CREATE TABLE IF NOT EXISTS table1 (val1 TEXT, val2 TEXT, val3 TEXT);
CREATE TABLE IF NOT EXISTS table2 (val1 TEXT, val2 TEXT, val3 TEXT);
/*****************************************************************/
/* load the two different sets of data and also some common data */
INSERT INTO table1 VALUES ('A','AA','AAA'),('B','BB','BBB'),('C','CC','CCC'),('M','MM','MMM');
INSERT INTO table2 VALUES ('X','XX','XXX'),('Y','YY','YYY'),('Z','ZZ','ZZZ'),('M','MM','MMM');
/*************************************************************/
/* Macth each table to the other using an INSERT .... SELECT */
/*************************************************************/
INSERT INTO table1 SELECT * FROM table2 WHERE val1||val2||val3 NOT IN (SELECT(val1||val2||val3) FROM table1);
INSERT INTO table2 SELECT * FROM table1 WHERE val1||val2||val3 NOT IN (SELECT(val1||val2||val3) FROM table2);
/* Output both tables */
SELECT 'T1',* FROM table1;
SELECT 'T2',* FROM table1;
/* Cleanup */
DROP TABLE IF EXISTS table1;
DROP TABLE IF EXISTS table2;
The results of the 2 SELECTS being :-
and
the first column (T1 or T2) just being used to indicate which table the SELECT is from.
table1 has the X,Y and Z values rows copied from table2
table2 has the A,B and C values rows copied from table1
the M values row, as they exist in both remain intact, they are neither duplicated nor deleted.
Thus data wise the two tables are identical.
Related
Consider some SQLite database.db, with a large number of tables and columns.
Panda's .describe() produces the summary statistics that I want (see below). However, it requires reading each table in full - a problem for large data bases. Is there an (SQL or Python) alternative that is less memory hungry? Specifiying column names manually is not feasible here.
import pandas as pd
import sqlite3
con = sqlite3.connect("file:database.db", uri=True)
tables = pd.read_sql("SELECT name FROM sqlite_master WHERE type='table'", con)
columns = []
for _, row in tables.iterrows():
col = pd.read_sql(f"PRAGMA table_info({row['name']})", con)
col['table'] = row['name']
stats = pd.read_sql(f"""SELECT * FROM {row['name']}""", con)
stats = stats.describe(include='all')
stats = stats.transpose()
col = col.merge(stats, left_on='name', right_index=True)
columns.append(col)
columns = pd.concat(columns)
Perhaps a little over the top but you could use TRIGGERS to maintain statistics and eliminate the need for full table scans. There will obviously be some overhead for maintaining the statistics but the overheads are distributed over time.
Perhaps consider the following demo (in SQL) where there are two main tables tablea and tablex (could be any number of tables). Another table called statistic which will be used to dynamically store statistics.
For each main table 3 triggers are created 1 for when a row is inserted, one for when a row is updated and one for when a row is deleted. So 6 triggers in all for the 2 main tables.
The statistic tables has 5 columns
tablename which is the primary key and holds the name is the table the row stores statistics about
row_count for the number of rows (in theory)
The insert trigger for the respective table increments the row_count
The delete trigger decrements the row_count
insert_count
The insert trigger increments the insert_count
update_count
the update trigger increments the update_count
delete_count
the delete trigger increments the delete_count
All of the triggers first try to insert the respective row for the table with all values using the default of 0. As the tablename is the primary key the INSERT OR IGNORE ensures that the row is only added the once (unless the row is deleted (effectively resetting the stats for the table))
The demo includes some insertions, deletions and updates and finally extraction of the statistics:-
DROP TABLE IF EXISTS tablea;
DROP TABLE IF EXISTS tablex;
DROP TABLE IF EXISTS statistic;
CREATE TABLE IF NOT EXISTS statistic (
tablename TEXT PRIMARY KEY,
row_count INTEGER DEFAULT 0,
insert_count INTEGER DEFAULT 0,
update_count INTEGER DEFAULT 0,
delete_count INTEGER DEFAULT 0
);
CREATE TABLE IF NOT EXISTS tablea (id INTEGER PRIMARY KEY, data1 TEXT);
CREATE TRIGGER IF NOT EXISTS tablea_after_ins AFTER INSERT ON tablea
BEGIN
INSERT OR IGNORE INTO statistic (tablename) VALUES('tablea');
UPDATE statistic SET row_count = row_count +1, insert_count = insert_count + 1 WHERE tablename = 'tablea';
END
;
CREATE TRIGGER IF NOT EXISTS tablea_after_update AFTER UPDATE ON tablea
BEGIN
INSERT OR IGNORE INTO statistic (tablename) VALUES('tablea');
UPDATE statistic SET update_count = update_count + 1 WHERE tablename = 'tablea';
END
;
CREATE TRIGGER IF NOT EXISTS tablea_after_delete AFTER DELETE ON tablea
BEGIN
INSERT OR IGNORE INTO statistic (tablename) VALUES('tablea');
UPDATE statistic SET row_count = row_count -1, delete_count = delete_count + 1 WHERE tablename = 'tablea';
END
;
CREATE TABLE IF NOT EXISTS tablex (id INTEGER PRIMARY KEY, data1 TEXT);
CREATE TRIGGER IF NOT EXISTS tablex_after_ins AFTER INSERT ON tablex
BEGIN
INSERT OR IGNORE INTO statistic (tablename) VALUES('tablex');
UPDATE statistic SET row_count = row_count +1, insert_count = insert_count + 1 WHERE tablename = 'tablex';
END
;
CREATE TRIGGER IF NOT EXISTS tablex_after_update AFTER UPDATE ON tablex
BEGIN
INSERT OR IGNORE INTO statistic (tablename) VALUES('tablex');
UPDATE statistic SET update_count = update_count + 1 WHERE tablename = 'tablex';
END
;
CREATE TRIGGER IF NOT EXISTS tablea_after_delete AFTER DELETE ON tablea
BEGIN
INSERT OR IGNORE INTO statistic (tablename) VALUES('tablex');
UPDATE statistic SET row_count = row_count -1, delete_count = delete_count + 1 WHERE tablename = 'tablex';
END
;
INSERT INTO tablea (data1) VALUES('a');
INSERT INTO tablea (data1) VALUES('b'),('c'),('d'),('z');
DELETE FROM tablea WHERE data1 LIKE 'z';
UPDATE tablea set data1 = 'letter_'||data1;
DELETE FROM tablea WHERE data1 LIKE '%_c';
INSERT OR IGNORE INTO tablex (data1) VALUES
('1a'),('2a'),('3a'),('4a'),('5a')
,('1b'),('2b'),('3b'),('4b'),('5b')
,('1c'),('2c'),('3c'),('4c'),('5c')
,('1d'),('2d'),('3d'),('4d'),('5d')
;
SELECT * FROM statistic;
/* Cleanup the demo environment */
DROP TABLE IF EXISTS tablea;
DROP TABLE IF EXISTS tablex;
DROP TABLE IF EXISTS statistic;
When run the result is :-
note that the mass insert into tablex records all 20 rows added (i.e the trigger is triggered for every insert and that the triggering is part of the transaction)
When I'm trying to remove all tables with:
base.metadata.drop_all(engine)
I'm getting following error:
ERROR:libdl.database_operations:Cannot drop table: (psycopg2.errors.DependentObjectsStillExist) cannot drop sequence <schema>.<sequence> because other objects depend on it
DETAIL: default for table <schema>.<table> column id depends on sequence <schema>.<sequence>
HINT: Use DROP ... CASCADE to drop the dependent objects too.
Is there an elegant one-line solution for that?
import psycopg2
from psycopg2 import sql
cnn = psycopg2.connect('...')
cur = cnn.cursor()
cur.execute("""
select s.nspname as s, t.relname as t
from pg_class t join pg_namespace s on s.oid = t.relnamespace
where t.relkind = 'r'
and s.nspname !~ '^pg_' and s.nspname != 'information_schema'
order by 1,2
""")
tables = cur.fetchall() # make sure they are the right ones
for t in tables:
cur.execute(
sql.SQL("drop table if exists {}.{} cascade")
.format(sql.Identifier(t[0]), sql.Identifier(t[1])))
cnn.commit() # goodbye
This is my query using code found perusing this site:
query="""SELECT Family
FROM Table2
INNER JOIN Table1 ON Table1.idSequence=Table2.idSequence
WHERE (Table1.Chromosome, Table1.hg19_coordinate) IN ({seq})
""".format(seq=','.join(['?']*len(matchIds_list)))
matchIds_list is a list of tuples in (?,?) format.
It works if I just ask for one condition (ie just Table1.Chromosome as oppose to both Chromosome and hg_coordinate) and matchIds_list is just a simple list of single values, but I don't know how to get it to work with a composite key or both columns.
Since you're running SQLite 3.7.17, I'd recommend to just use a temporary table.
Create and populate your temporary table.
cursor.executescript("""
CREATE TEMP TABLE control_list (
Chromosome TEXT NOT NULL,
hg19_coordinate TEXT NOT NULL
);
CREATE INDEX control_list_idx ON control_list (Chromosome, hg19_coordinate);
""")
cursor.executemany("""
INSERT INTO control_list (Chromosome, hg19_coordinate)
VALUES (?, ?)
""", matchIds_list)
Just constrain your query to the control list temporary table.
SELECT Family
FROM Table2
INNER JOIN Table1
ON Table1.idSequence = Table2.idSequence
-- Constrain to control_list.
WHERE EXISTS (
SELECT *
FROM control_list
WHERE control_list.Chromosome = Table1.Chromosome
AND control_list.hg19_coordinate = Table1.hg19_coordinate
)
And finally perform your query (there's no need to format this one).
cursor.execute(query)
# Remove the temporary table since we're done with it.
cursor.execute("""
DROP TABLE control_list;
""")
Short Query (requires SQLite 3.15): You actually almost had it. You need to make the IN ({seq}) a subquery
expression.
SELECT Family
FROM Table2
INNER JOIN Table1
ON Table1.idSequence = Table2.idSequence
WHERE (Table1.Chromosome, Table1.hg19_coordinate) IN (VALUES {seq});
Long Query (requires SQLite 3.8.3): It looks a little complicated, but it's pretty straight forward. Put your
control list into a sub-select, and then constrain that main select by the control
list.
SELECT Family
FROM Table2
INNER JOIN Table1
ON Table1.idSequence = Table2.idSequence
-- Constrain to control_list.
WHERE EXISTS (
SELECT *
FROM (
SELECT
-- Name the columns (must match order in tuples).
"" AS Chromosome,
":1" AS hg19_coordinate
FROM (
-- Get control list.
VALUES {seq}
) AS control_values
) AS control_list
-- Constrain Table1 to control_list.
WHERE control_list.Chromosome = Table1.Chromosome
AND control_list.hg19_coordinate = Table1.hg19_coordinate
)
Regardless of which query you use, when formatting the SQL replace {seq} with (?,?) for each compsite
key instead of just ?.
query = " ... ".format(seq=','.join(['(?,?)']*len(matchIds_list)))
And finally flatten matchIds_list when you execute the query because it is a list of tuples.
import itertools
cursor.execute(query, list(itertools.chain.from_iterable(matchIds_list)))
This is a follow-up question. Below is a piece of my Python script that reads a constantly growing log files (text) and insert data into Postgresql DB. New log file generated each day. What I do is I commit each line which cuases a huge load and a really poor performance (needs 4 hours to insert 30 min of the file data!). How can I improve this code to insert bulks insead of lines? and would this help improve the performance and reduce load? I've read about copy_from but couldn't figure out how to use it in such situation.
import psycopg2 as psycopg
try:
connectStr = "dbname='postgis20' user='postgres' password='' host='localhost'"
cx = psycopg.connect(connectStr)
cu = cx.cursor()
logging.info("connected to DB")
except:
logging.error("could not connect to the database")
import time
file = open('textfile.log', 'r')
while 1:
where = file.tell()
line = file.readline()
if not line:
time.sleep(1)
file.seek(where)
else:
print line, # already has newline
dodecode(line)
------------
def dodecode(fields):
global cx
from time import strftime, gmtime
from calendar import timegm
import os
msg = fields.split(',')
part = eval(msg[2])
msgnum = int(msg[3:6])
print "message#:", msgnum
print fields
if (part==1):
if msgnum==1:
msg1 = msg_1.decode(bv)
#print "message1 :",msg1
Insert(msgnum,time,msg1)
elif msgnum==2:
msg2 = msg_2.decode(bv)
#print "message2 :",msg2
Insert(msgnum,time,msg2)
elif msgnum==3:
....
....
....
----------------
def Insert(msgnum,time,msg):
global cx
try:
if msgnum in [1,2,3]:
if msg['type']==0:
cu.execute("INSERT INTO table1 ( messageid, timestamp, userid, position, text ) SELECT "+str(msgnum)+", '"+time+"', "+str(msg['UserID'])+", ST_GeomFromText('POINT("+str(float(msg['longitude']), '"+text+"')+" "+str(float(msg['latitude']))+")']))+" WHERE NOT EXISTS (SELECT * FROM table1 WHERE timestamp='"+time+"' AND text='"+text+"';")
cu.execute("INSERT INTO table2 ( field1,field2,field3, time_stamp, pos,) SELECT "+str(msg['UserID'])+","+str(int(msg['UserName']))+","+str(int(msg['UserIO']))+", '"+time+"', ST_GeomFromText('POINT("+str(float(msg['longitude']))+" "+str(float(msg['latitude']))+")')," WHERE NOT EXISTS (SELECT * FROM table2 WHERE field1="+str(msg['UserID'])+");")
cu.execute("Update table2 SET field3='"+str(int(msg['UserIO']))+"',time_stamp='"+str(time)+"',pos=ST_GeomFromText('POINT("+str(float(msg['longitude']))+" "+str(float(msg['latitude']))+")'),"' WHERE field1='"+str(msg['UserID'])+"' AND time_stamp < '"+str(time)+"';")
elif msg['type']==1:
cu.execute("INSERT INTO table1 ( messageid, timestamp, userid, position, text ) SELECT "+str(msgnum)+", '"+time+"', "+str(msg['UserID'])+", ST_GeomFromText('POINT("+str(float(msg['longitude']), '"+text+"')+" "+str(float(msg['latitude']))+")']))+" WHERE NOT EXISTS (SELECT * FROM table1 WHERE timestamp='"+time+"' AND text='"+text+"';")
cu.execute("INSERT INTO table2 ( field1,field2,field3, time_stamp, pos,) SELECT "+str(msg['UserID'])+","+str(int(msg['UserName']))+","+str(int(msg['UserIO']))+", '"+time+"', ST_GeomFromText('POINT("+str(float(msg['longitude']))+" "+str(float(msg['latitude']))+")')," WHERE NOT EXISTS (SELECT * FROM table2 WHERE field1="+str(msg['UserID'])+");")
cu.execute("Update table2 SET field3='"+str(int(msg['UserIO']))+"',time_stamp='"+str(time)+"',pos=ST_GeomFromText('POINT("+str(float(msg['longitude']))+" "+str(float(msg['latitude']))+")'),"' WHERE field1='"+str(msg['UserID'])+"' AND time_stamp < '"+str(time)+"';")
elif msg['type']==2:
....
....
....
except Exception, err:
#print('ERROR: %s\n' % str(err))
logging.error('ERROR: %s\n' % str(err))
cx.commit()
cx.commit()
doing multiple rows per transaction, and per query will make it go faster,
when faced with a similar problem I put multiple rows in the values part of the insert query,
but you have complicated insert queries, so you'll likely need a different approach.
I'd suggest creating a temporary table and inserting say 10000 rows into it with ordinary multi-row inserts
insert into temptable values ( /* row1 data */ ) ,( /* row2 data */ ) etc...
500 rows per insert.is a good starting point.
then joining the temp table with the existing data to de-dupe it.
delete from temptable using livetable where /* .join condition */ ;
and de-duping it against itself if that is needed too
delete from temptable where id not in
( select distinct on ( /* unique columns */) id from temptable);
then using insert-select to copy the rows from the temporary table into the live table
insert into livetable ( /* columns */ )
select /* columns */ from temptable;
it looks like you might need an update-from too
and finally dropping the temp table and starting again.
ans you're writing two tables you;re going to need to double-up all these operations.
I'd do the insert by maintaing a count and a list of values to insert and then at insert time
building a repeating the (%s,%s,%s,%s) part ot the query as many times as needed and passing the list of values in separately and letting psycopg2 deal with the formatting.
I'd expect making those changes could get you a speed up of 5 times for more
I have created this table in python 2.7 . I use it to store unique pairs name and value. In some queries I search for names and in others I search for values. Lets say that SELECT queries are 50-50. Is there any way to create a table that will be double index (one index on names and another for values) so my program will seek faster the data ?
Here is the database and table creation:
import sqlite3
#-------------------------db creation ---------------------------------------#
db1 = sqlite3.connect('/my_db.db')
cursor = db1.cursor()
cursor.execute("DROP TABLE IF EXISTS my_table")
sql = '''CREATE TABLE my_table (
name TEXT DEFAULT NULL,
value INT
);'''
cursor.execute(sql)
sql = ("CREATE INDEX index_my_table ON my_table (name);")
cursor.execute(sql)
Or is there any other faster struct for faster value seek ?
You can create another index...
sql = ("CREATE INDEX index_my_table2 ON my_table (value);")
cursor.execute(sql)
I think the best way for faster research is to create a index on the 2 fields.
like: sql = ("CREATE INDEX index_my_table ON my_table (Field1, field2)")
Multi-Column Indices or Covering Indices.
see the (great) doc here: https://www.sqlite.org/queryplanner.html