This is a follow-up question. Below is a piece of my Python script that reads a constantly growing log files (text) and insert data into Postgresql DB. New log file generated each day. What I do is I commit each line which cuases a huge load and a really poor performance (needs 4 hours to insert 30 min of the file data!). How can I improve this code to insert bulks insead of lines? and would this help improve the performance and reduce load? I've read about copy_from but couldn't figure out how to use it in such situation.
import psycopg2 as psycopg
try:
connectStr = "dbname='postgis20' user='postgres' password='' host='localhost'"
cx = psycopg.connect(connectStr)
cu = cx.cursor()
logging.info("connected to DB")
except:
logging.error("could not connect to the database")
import time
file = open('textfile.log', 'r')
while 1:
where = file.tell()
line = file.readline()
if not line:
time.sleep(1)
file.seek(where)
else:
print line, # already has newline
dodecode(line)
------------
def dodecode(fields):
global cx
from time import strftime, gmtime
from calendar import timegm
import os
msg = fields.split(',')
part = eval(msg[2])
msgnum = int(msg[3:6])
print "message#:", msgnum
print fields
if (part==1):
if msgnum==1:
msg1 = msg_1.decode(bv)
#print "message1 :",msg1
Insert(msgnum,time,msg1)
elif msgnum==2:
msg2 = msg_2.decode(bv)
#print "message2 :",msg2
Insert(msgnum,time,msg2)
elif msgnum==3:
....
....
....
----------------
def Insert(msgnum,time,msg):
global cx
try:
if msgnum in [1,2,3]:
if msg['type']==0:
cu.execute("INSERT INTO table1 ( messageid, timestamp, userid, position, text ) SELECT "+str(msgnum)+", '"+time+"', "+str(msg['UserID'])+", ST_GeomFromText('POINT("+str(float(msg['longitude']), '"+text+"')+" "+str(float(msg['latitude']))+")']))+" WHERE NOT EXISTS (SELECT * FROM table1 WHERE timestamp='"+time+"' AND text='"+text+"';")
cu.execute("INSERT INTO table2 ( field1,field2,field3, time_stamp, pos,) SELECT "+str(msg['UserID'])+","+str(int(msg['UserName']))+","+str(int(msg['UserIO']))+", '"+time+"', ST_GeomFromText('POINT("+str(float(msg['longitude']))+" "+str(float(msg['latitude']))+")')," WHERE NOT EXISTS (SELECT * FROM table2 WHERE field1="+str(msg['UserID'])+");")
cu.execute("Update table2 SET field3='"+str(int(msg['UserIO']))+"',time_stamp='"+str(time)+"',pos=ST_GeomFromText('POINT("+str(float(msg['longitude']))+" "+str(float(msg['latitude']))+")'),"' WHERE field1='"+str(msg['UserID'])+"' AND time_stamp < '"+str(time)+"';")
elif msg['type']==1:
cu.execute("INSERT INTO table1 ( messageid, timestamp, userid, position, text ) SELECT "+str(msgnum)+", '"+time+"', "+str(msg['UserID'])+", ST_GeomFromText('POINT("+str(float(msg['longitude']), '"+text+"')+" "+str(float(msg['latitude']))+")']))+" WHERE NOT EXISTS (SELECT * FROM table1 WHERE timestamp='"+time+"' AND text='"+text+"';")
cu.execute("INSERT INTO table2 ( field1,field2,field3, time_stamp, pos,) SELECT "+str(msg['UserID'])+","+str(int(msg['UserName']))+","+str(int(msg['UserIO']))+", '"+time+"', ST_GeomFromText('POINT("+str(float(msg['longitude']))+" "+str(float(msg['latitude']))+")')," WHERE NOT EXISTS (SELECT * FROM table2 WHERE field1="+str(msg['UserID'])+");")
cu.execute("Update table2 SET field3='"+str(int(msg['UserIO']))+"',time_stamp='"+str(time)+"',pos=ST_GeomFromText('POINT("+str(float(msg['longitude']))+" "+str(float(msg['latitude']))+")'),"' WHERE field1='"+str(msg['UserID'])+"' AND time_stamp < '"+str(time)+"';")
elif msg['type']==2:
....
....
....
except Exception, err:
#print('ERROR: %s\n' % str(err))
logging.error('ERROR: %s\n' % str(err))
cx.commit()
cx.commit()
doing multiple rows per transaction, and per query will make it go faster,
when faced with a similar problem I put multiple rows in the values part of the insert query,
but you have complicated insert queries, so you'll likely need a different approach.
I'd suggest creating a temporary table and inserting say 10000 rows into it with ordinary multi-row inserts
insert into temptable values ( /* row1 data */ ) ,( /* row2 data */ ) etc...
500 rows per insert.is a good starting point.
then joining the temp table with the existing data to de-dupe it.
delete from temptable using livetable where /* .join condition */ ;
and de-duping it against itself if that is needed too
delete from temptable where id not in
( select distinct on ( /* unique columns */) id from temptable);
then using insert-select to copy the rows from the temporary table into the live table
insert into livetable ( /* columns */ )
select /* columns */ from temptable;
it looks like you might need an update-from too
and finally dropping the temp table and starting again.
ans you're writing two tables you;re going to need to double-up all these operations.
I'd do the insert by maintaing a count and a list of values to insert and then at insert time
building a repeating the (%s,%s,%s,%s) part ot the query as many times as needed and passing the list of values in separately and letting psycopg2 deal with the formatting.
I'd expect making those changes could get you a speed up of 5 times for more
Related
Consider some SQLite database.db, with a large number of tables and columns.
Panda's .describe() produces the summary statistics that I want (see below). However, it requires reading each table in full - a problem for large data bases. Is there an (SQL or Python) alternative that is less memory hungry? Specifiying column names manually is not feasible here.
import pandas as pd
import sqlite3
con = sqlite3.connect("file:database.db", uri=True)
tables = pd.read_sql("SELECT name FROM sqlite_master WHERE type='table'", con)
columns = []
for _, row in tables.iterrows():
col = pd.read_sql(f"PRAGMA table_info({row['name']})", con)
col['table'] = row['name']
stats = pd.read_sql(f"""SELECT * FROM {row['name']}""", con)
stats = stats.describe(include='all')
stats = stats.transpose()
col = col.merge(stats, left_on='name', right_index=True)
columns.append(col)
columns = pd.concat(columns)
Perhaps a little over the top but you could use TRIGGERS to maintain statistics and eliminate the need for full table scans. There will obviously be some overhead for maintaining the statistics but the overheads are distributed over time.
Perhaps consider the following demo (in SQL) where there are two main tables tablea and tablex (could be any number of tables). Another table called statistic which will be used to dynamically store statistics.
For each main table 3 triggers are created 1 for when a row is inserted, one for when a row is updated and one for when a row is deleted. So 6 triggers in all for the 2 main tables.
The statistic tables has 5 columns
tablename which is the primary key and holds the name is the table the row stores statistics about
row_count for the number of rows (in theory)
The insert trigger for the respective table increments the row_count
The delete trigger decrements the row_count
insert_count
The insert trigger increments the insert_count
update_count
the update trigger increments the update_count
delete_count
the delete trigger increments the delete_count
All of the triggers first try to insert the respective row for the table with all values using the default of 0. As the tablename is the primary key the INSERT OR IGNORE ensures that the row is only added the once (unless the row is deleted (effectively resetting the stats for the table))
The demo includes some insertions, deletions and updates and finally extraction of the statistics:-
DROP TABLE IF EXISTS tablea;
DROP TABLE IF EXISTS tablex;
DROP TABLE IF EXISTS statistic;
CREATE TABLE IF NOT EXISTS statistic (
tablename TEXT PRIMARY KEY,
row_count INTEGER DEFAULT 0,
insert_count INTEGER DEFAULT 0,
update_count INTEGER DEFAULT 0,
delete_count INTEGER DEFAULT 0
);
CREATE TABLE IF NOT EXISTS tablea (id INTEGER PRIMARY KEY, data1 TEXT);
CREATE TRIGGER IF NOT EXISTS tablea_after_ins AFTER INSERT ON tablea
BEGIN
INSERT OR IGNORE INTO statistic (tablename) VALUES('tablea');
UPDATE statistic SET row_count = row_count +1, insert_count = insert_count + 1 WHERE tablename = 'tablea';
END
;
CREATE TRIGGER IF NOT EXISTS tablea_after_update AFTER UPDATE ON tablea
BEGIN
INSERT OR IGNORE INTO statistic (tablename) VALUES('tablea');
UPDATE statistic SET update_count = update_count + 1 WHERE tablename = 'tablea';
END
;
CREATE TRIGGER IF NOT EXISTS tablea_after_delete AFTER DELETE ON tablea
BEGIN
INSERT OR IGNORE INTO statistic (tablename) VALUES('tablea');
UPDATE statistic SET row_count = row_count -1, delete_count = delete_count + 1 WHERE tablename = 'tablea';
END
;
CREATE TABLE IF NOT EXISTS tablex (id INTEGER PRIMARY KEY, data1 TEXT);
CREATE TRIGGER IF NOT EXISTS tablex_after_ins AFTER INSERT ON tablex
BEGIN
INSERT OR IGNORE INTO statistic (tablename) VALUES('tablex');
UPDATE statistic SET row_count = row_count +1, insert_count = insert_count + 1 WHERE tablename = 'tablex';
END
;
CREATE TRIGGER IF NOT EXISTS tablex_after_update AFTER UPDATE ON tablex
BEGIN
INSERT OR IGNORE INTO statistic (tablename) VALUES('tablex');
UPDATE statistic SET update_count = update_count + 1 WHERE tablename = 'tablex';
END
;
CREATE TRIGGER IF NOT EXISTS tablea_after_delete AFTER DELETE ON tablea
BEGIN
INSERT OR IGNORE INTO statistic (tablename) VALUES('tablex');
UPDATE statistic SET row_count = row_count -1, delete_count = delete_count + 1 WHERE tablename = 'tablex';
END
;
INSERT INTO tablea (data1) VALUES('a');
INSERT INTO tablea (data1) VALUES('b'),('c'),('d'),('z');
DELETE FROM tablea WHERE data1 LIKE 'z';
UPDATE tablea set data1 = 'letter_'||data1;
DELETE FROM tablea WHERE data1 LIKE '%_c';
INSERT OR IGNORE INTO tablex (data1) VALUES
('1a'),('2a'),('3a'),('4a'),('5a')
,('1b'),('2b'),('3b'),('4b'),('5b')
,('1c'),('2c'),('3c'),('4c'),('5c')
,('1d'),('2d'),('3d'),('4d'),('5d')
;
SELECT * FROM statistic;
/* Cleanup the demo environment */
DROP TABLE IF EXISTS tablea;
DROP TABLE IF EXISTS tablex;
DROP TABLE IF EXISTS statistic;
When run the result is :-
note that the mass insert into tablex records all 20 rows added (i.e the trigger is triggered for every insert and that the triggering is part of the transaction)
I have 2 database connections. I want to compare a single table from each connection to each other. And if there are unmatched records, I want to add them to the table database where they are missing.
This is what I came up with but id doesn't seem to do the inserting part. I'm new to python excuse the code thanks.
# establishing connections and querying the database
import sqlite3
con1 = sqlite3.connect("database1.db")
cur1 = con1.cursor()
table1 = cur1.execute("SELECT * FROM table1")
fetch_table1 = table1.fetchall()
mylist = list(table1)
con2 = sqlite3.connect("database2.db")
cur2 = con2.cursor()
table2= cur2.execute("SELECT * FROM table2")
table2 = table2.fetchall()
mylist2 = list(table2).
# finding unmatched eliminates and inserting them to the database
def non_match_elements(mylist2, mylist):
non_match = []
for i in mylist2:
if i not in mylist:
non_match.append(i)
non_match = non_match_elements(mylist2, mylist)
cur1.executemany("""INSERT INTO table 1 VALUES (?,?,?)""", non_match)
con1.commit()
res = cur1.execute("select column from table1")
print(res.fetchall())
Thanks again guys
I would suggest ATTACHing one connection to the other, you then the have two INSERT INTO table SELECT * FROM table2 WHERE query that would insert from one table to the other.
here's an example/demo (not of the ATTACH DATABASE but of aligning two tables with the same schema but with different data):-
/* Cleanup - just in case*/
DROP TABLE IF EXISTS table1;
DROP TABLE IF EXISTS table2;
/* Create the two tables */
CREATE TABLE IF NOT EXISTS table1 (val1 TEXT, val2 TEXT, val3 TEXT);
CREATE TABLE IF NOT EXISTS table2 (val1 TEXT, val2 TEXT, val3 TEXT);
/*****************************************************************/
/* load the two different sets of data and also some common data */
INSERT INTO table1 VALUES ('A','AA','AAA'),('B','BB','BBB'),('C','CC','CCC'),('M','MM','MMM');
INSERT INTO table2 VALUES ('X','XX','XXX'),('Y','YY','YYY'),('Z','ZZ','ZZZ'),('M','MM','MMM');
/*************************************************************/
/* Macth each table to the other using an INSERT .... SELECT */
/*************************************************************/
INSERT INTO table1 SELECT * FROM table2 WHERE val1||val2||val3 NOT IN (SELECT(val1||val2||val3) FROM table1);
INSERT INTO table2 SELECT * FROM table1 WHERE val1||val2||val3 NOT IN (SELECT(val1||val2||val3) FROM table2);
/* Output both tables */
SELECT 'T1',* FROM table1;
SELECT 'T2',* FROM table1;
/* Cleanup */
DROP TABLE IF EXISTS table1;
DROP TABLE IF EXISTS table2;
The results of the 2 SELECTS being :-
and
the first column (T1 or T2) just being used to indicate which table the SELECT is from.
table1 has the X,Y and Z values rows copied from table2
table2 has the A,B and C values rows copied from table1
the M values row, as they exist in both remain intact, they are neither duplicated nor deleted.
Thus data wise the two tables are identical.
I am performing an ETL task where I am querying tables in a Data Warehouse to see if it contains IDs in a DataFrame (df) which was created by joining tables from the operational database.
The DataFrame only has ID columns from each joined table in the operational database. I have created a variable for each of these columns, e.g. 'billing_profiles_id' as below:
billing_profiles_dim_id = df['billing_profiles_dim_id']
I am attempting to iterated row by row to see if the ID here is in the 'billing_profiles_dim' table of the Data Warehouse. Where the ID is not present, I want to populate the DWH tables row by row using the matching ID rows in the ODB:
for key in billing_profiles_dim_id:
sql = "SELECT * FROM billing_profiles_dim WHERE id = '"+str(key)+"'"
dwh_cursor.execute(sql)
result = dwh_cursor.fetchone()
if result == None:
sqlQuery = "SELECT * from billing_profile where id = '"+str(key)+"'"
sqlInsert = "INSERT INTO billing_profile_dim VALUES ('"+str(key)+"','"+billing_profile.name"')
op_cursor = op_connector.execute(sqlInsert)
billing_profile = op_cursor.fetchone()
So far at least, I am receiving the following error:
SyntaxError: EOL while scanning string literal
This error message points at the close of barcket at
sqlInsert = "INSERT INTO billing_profile_dim VALUES ('"+str(key)+"','"+billing_profile.name"')
Which I am currently unable to solve. I'm also aware that this code may run into another problem or two. Could someone please see how I can solve the current issue and please ensure that I head down the correct path?
You are missing a double tick and a +
sqlInsert = "INSERT INTO billing_profile_dim VALUES ('"+str(key)+"','"+billing_profile.name+"')"
But you should really switch to prepared statements like
sql = "SELECT * FROM billing_profiles_dim WHERE id = '%s'"
dwh_cursor.execute(sql,(str(key),))
...
sqlInsert = ('INSERT INTO billing_profile_dim VALUES '
'(%s, %s )')
dwh_cursor.execute(sqlInsert , (str(key), billing_profile.name))
Hope everyone's doing well.
Database:
Value Date
---------------------------------
3000 2019-12-15
6000 2019-12-17
What I hope to return:
"Data:3000 on 2019-12-15"
"NO data on 2019-12-16" (non-existing column based on Date)
"Data:6000 on 2019-12-17"
I don't know how to filter non-existing records(rows) based on a column.
Possible boilerplate code:
db = sqlite3.connect("Database1.db")
cursor = db.cursor()
cursor.execute("""
SELECT * FROM Table1
WHERE Date >= "2019-12-15" and Date <= "2019-12-17"
""")
entry = cursor.fetchall()
for i in entry:
if i is None:
print("No entry found:", i)
else:
print("Entry found")
db.close()
Any help is much appreciated!
The general way you might handle this problem uses something called a calendar table, which is just a table containing all dates you want to see in your report. Consider the following query:
SELECT
d.dt,
t.Value
FROM
(
SELECT '2019-12-15' AS dt UNION ALL
SELECT '2019-12-16' UNION ALL
SELECT '2019-12-17'
) d
LEFT JOIN yourTable t
ON d.dt = t.Date
ORDER BY
d.dt;
In practice, if you had a long term need to do this and/or had a large number of dates to cover, you might setup a bona-fide calendar table in your SQLite database for this purpose. The above query is only intended to be a proof-of-concept.
I have a table with ~133M rows and 16 columns. I want to create 14 tables on another database on the same server for each of columns 3-16 (columns 1 and 2 are `id` and `timestamp` which will be in the final 14 tables as well but won't have their own table), where each table will have the name of the original column. Is this possible to do exclusively with an SQL script? It seems logical to me that this would be the preferred, and fastest way to do it.
Currently, I have a Python script that "works" by parsing the CSV dump of the original table (testing with 50 rows), creating new tables, and adding the associated values, but it is very slow (I estimated almost 1 year to transfer all 133M rows, which is obviously not acceptable). This is my first time using SQL in any capacity, and I'm certain that my code can be sped up, but I'm not sure how because of my unfamiliarity with SQL. The big SQL string command in the middle was copied from some other code in our codebase. I've tried using transactions as seen below, but it didn't seem to have any significant effect on the speed.
import re
import mysql.connector
import time
# option flags
debug = False # prints out information during runtime
timing = True # times the execution time of the program
# save start time for timing. won't be used later if timing is false
start_time = time.time()
# open file for reading
path = 'test_vaisala_sql.csv'
file = open(path, 'r')
# read in column values
column_str = file.readline().strip()
columns = re.split(',vaisala_|,', column_str) # parse columns with regex to remove commas and vasiala_
if debug:
print(columns)
# open connection to MySQL server
cnx = mysql.connector.connect(user='root', password='<redacted>',
host='127.0.0.1',
database='measurements')
cursor = cnx.cursor()
# create the table in the MySQL database if it doesn't already exist
for i in range(2, len(columns)):
table_name = 'vaisala2_' + columns[i]
sql_command = "CREATE TABLE IF NOT EXISTS " + \
table_name + "(`id` BIGINT(20) NOT NULL AUTO_INCREMENT, " \
"`timestamp` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP, " \
"`milliseconds` BIGINT(20) NOT NULL DEFAULT '0', " \
"`value` varchar(255) DEFAULT NULL, " \
"PRIMARY KEY (`id`), " \
"UNIQUE KEY `milliseconds` (`milliseconds`)" \
"COMMENT 'Eliminates duplicate millisecond values', " \
"KEY `timestamp` (`timestamp`)) " \
"ENGINE=InnoDB DEFAULT CHARSET=utf8;"
if debug:
print("Creating table", table_name, "in database")
cursor.execute(sql_command)
# read in rest of lines in CSV file
for line in file.readlines():
cursor.execute("START TRANSACTION;")
line = line.strip()
values = re.split(',"|",|,', line) # regex split along commas, or commas and quotes
if debug:
print(values)
# iterate of each data column. Starts at 2 to eliminate `id` and `timestamp`
for i in range(2, len(columns)):
table_name = "vaisala2_" + columns[i]
timestamp = values[1]
# translate timestamp back to epoch time
try:
pattern = '%Y-%m-%d %H:%M:%S'
epoch = int(time.mktime(time.strptime(timestamp, pattern)))
milliseconds = epoch * 1000 # convert seconds to ms
except ValueError: # errors default to 0
milliseconds = 0
value = values[i]
# generate SQL command to insert data into destination table
sql_command = "INSERT IGNORE INTO {} VALUES (NULL,'{}',{},'{}');".format(table_name, timestamp,
milliseconds, value)
if debug:
print(sql_command)
cursor.execute(sql_command)
cnx.commit() # commits changes in destination MySQL server
# print total execution time
if timing:
print("Completed in %s seconds" % (time.time() - start_time))
This doesn't need to be incredibly optimized; it's perfectly acceptable if the machine has to run for a few days in order to do it. But 1 year is far too long.
You can create a table from a SELECT like:
CREATE TABLE <other database name>.<column name>
AS
SELECT <column name>
FROM <original database name>.<table name>;
(Replace the <...> with your actual object names or extend it with other columns or a WHERE clause or ...)
That will also insert the data from the query into the new table. And it's probably the fastest way.
You could use dynamic SQL and information from the catalog (namely information_schema.columns) to create the CREATE statements or create them manually, which is annoying but acceptable for 14 columns I guess.
When using scripts to talk to databases you want to minimise the number of messages that are sent as each message creates a further delay on your execution time. Currently, it looks as if you are sending (by your approximation) 133 million messages, and thus, slowing down your script 133 million times. A simple optimisation would be to parse your spreadsheet and split the data into the tables (either in memory or saving them to disk) and only then send the data to the new DB.
As you hinted, it's much quicker to write an SQL script to redistribute the data.