I have a db with a million row, I want to fetch all the rows and do some operation on them an insert them into another table (newTable).
I figured out I need to use server side cursor, since I can not fetch all data into memory.
and I also figured out I need to use two connections so when I commit I dont loose the cursor that I made.
but now my problem is, it wont put all the records into the newTable as it shows in the log.
in console log I see it tries to insert 500,000 th record into the database
560530 inserting 20551581 and 2176511
but when I do a count on the created table (while it is doing it) it shows only about 10,000 rows in the new table .
select count(*) from newTable;
count
-------
10236
and when the program finishes, I only have about 11000 records in the new table, while in the records it shows it tried to insert at least 2 million rows. whats wrong with my code?
def fillMyTable(self):
try:
self.con=psycopg2.connect(database='XXXX',user='XXXX',password='XXXX',host='localhost')
cur=self.con.cursor(name="mycursor")
cur.arraysize=1000
cur.itersize=2000
self.con2=psycopg2.connect(database='XXXX',user='XXXX',password='XXXX',host='localhost')
cur2=self.con2.cursor()
q="SELECT id,oldgroups from oldTable;"
cur.execute(q)
i=0
while True:
batch= cur.fetchmany()
if not batch:
break
for row in batch:
userid=row[0]
groupids=self.doSomethingOnGroups(row[1])
for groupid in groupids:
# insert only if it does NOT exist
i+=1
print (str(i)+" inserting "+str(userid)+" and "+str(groupid))
q2="INSERT INTO newTable (userid, groupid) SELECT %s, %s WHERE NOT EXISTS ( SELECT %s FROM newTable WHERE groupid = %s);"%(userid,groupid,userid,groupid)
cur2.execute(q2)
self.con2.commit()
except psycopg2.DatabaseError, e:
self.writeLog(e)
finally:
cur.close()
self.con2.commit()
self.con.close()
self.con2.close()
Update : I also noticed it uses lots of my RAM, isnt server side cursor supposed not do that?
Cpu(s): 15.2%us, 6.4%sy, 0.0%ni, 56.5%id, 2.8%wa, 0.0%hi, 0.2%si,
18.9%st Mem: 1695220k total, 1680496k used, 14724k free, 3084k buffers Swap: 0k total, 0k used, 0k free,
1395020k cached
If the oldgroups column is in the form 1,3,6,7 this will work:
insert into newTable (userid, groupid)
select id, groupid
from (
select
id,
regexp_split_to_table(olgroups, ',') as groupid
from oldTable
) o
where
not exists (
select 1
from newTable
where groupid = o.groupid
)
and groupid < 10000000
But I suspect you want to check for the existence of both groupid and id:
insert into newTable (userid, groupid)
select id, groupid
from (
select
id,
regexp_split_to_table(olgroups, ',') as groupid
from oldTable
) o
where
not exists (
select 1
from newTable
where groupid = o.groupid and id = o.id
)
and groupid < 10000000
The regexp_split_to_table function will "explode" the oldgroups column in rows doing a cross join with the id column.
Related
OK so I'm trying to improve my asp data entry page to ensure that the entry going into my data table is unique.
So in this table I have SoftwareName and SoftwareType. I'm trying to get it so if the entry page sends an insert query with parameters that match whats in the table (so same title and type) then an error is thrown up and the Data isn't entered.
Something like this:
INSERT INTO tblSoftwareTitles(
SoftwareName,
SoftwareSystemType)
VALUES(#SoftwareName,#SoftwareType)
WHERE NOT EXISTS (SELECT SoftwareName
FROM tblSoftwareTitles
WHERE Softwarename = #SoftwareName
AND SoftwareType = #Softwaretype)
So this syntax works great for selecting columns from one table into another without duplicates being entered but doesn't seem to want to work with a parametrized insert query. Can anyone help me out with this?
Edit:
Here's the code I'm using in my ASP insert method
private void ExecuteInsert(string name, string type)
{
//Creates a new connection using the HWM string
using (SqlConnection HWM = new SqlConnection(GetConnectionStringHWM()))
{
//Creates a sql string with parameters
string sql = " INSERT INTO tblSoftwareTitles( "
+ " SoftwareName, "
+ " SoftwareSystemType) "
+ " SELECT "
+ " #SoftwareName, "
+ " #SoftwareType "
+ " WHERE NOT EXISTS "
+ " ( SELECT 1 "
+ " FROM tblSoftwareTitles "
+ " WHERE Softwarename = #SoftwareName "
+ " AND SoftwareSystemType = #Softwaretype); ";
//Opens the connection
HWM.Open();
try
{
//Creates a Sql command
using (SqlCommand addSoftware = new SqlCommand{
CommandType = CommandType.Text,
Connection = HWM,
CommandTimeout = 300,
CommandText = sql})
{
//adds parameters to the Sql command
addSoftware.Parameters.Add("#SoftwareName", SqlDbType.NVarChar, 200).Value = name;
addSoftware.Parameters.Add("#SoftwareType", SqlDbType.Int).Value = type;
//Executes the Sql
addSoftware.ExecuteNonQuery();
}
Alert.Show("Software title saved!");
}
catch (System.Data.SqlClient.SqlException ex)
{
string msg = "Insert Error:";
msg += ex.Message;
throw new Exception(msg);
}
}
}
You could do this using an IF statement:
IF NOT EXISTS
( SELECT 1
FROM tblSoftwareTitles
WHERE Softwarename = #SoftwareName
AND SoftwareSystemType = #Softwaretype
)
BEGIN
INSERT tblSoftwareTitles (SoftwareName, SoftwareSystemType)
VALUES (#SoftwareName, #SoftwareType)
END;
You could do it without IF using SELECT
INSERT tblSoftwareTitles (SoftwareName, SoftwareSystemType)
SELECT #SoftwareName,#SoftwareType
WHERE NOT EXISTS
( SELECT 1
FROM tblSoftwareTitles
WHERE Softwarename = #SoftwareName
AND SoftwareSystemType = #Softwaretype
);
Both methods are susceptible to a race condition, so while I would still use one of the above to insert, but you can safeguard duplicate inserts with a unique constraint:
CREATE UNIQUE NONCLUSTERED INDEX UQ_tblSoftwareTitles_Softwarename_SoftwareSystemType
ON tblSoftwareTitles (SoftwareName, SoftwareSystemType);
Example on SQL-Fiddle
ADDENDUM
In SQL Server 2008 or later you can use MERGE with HOLDLOCK to remove the chance of a race condition (which is still not a substitute for a unique constraint).
MERGE tblSoftwareTitles WITH (HOLDLOCK) AS t
USING (VALUES (#SoftwareName, #SoftwareType)) AS s (SoftwareName, SoftwareSystemType)
ON s.Softwarename = t.SoftwareName
AND s.SoftwareSystemType = t.SoftwareSystemType
WHEN NOT MATCHED BY TARGET THEN
INSERT (SoftwareName, SoftwareSystemType)
VALUES (s.SoftwareName, s.SoftwareSystemType);
Example of Merge on SQL Fiddle
This isn't an answer. I just want to show that IF NOT EXISTS(...) INSERT method isn't safe. You have to execute first Session #1 and then Session #2. After v #2 you will see that without an UNIQUE index you could get duplicate pairs (SoftwareName,SoftwareSystemType). Delay from session #1 is used to give you enough time to execute the second script (session #2). You could reduce this delay.
Session #1 (SSMS > New Query > F5 (Execute))
CREATE DATABASE DemoEXISTS;
GO
USE DemoEXISTS;
GO
CREATE TABLE dbo.Software(
SoftwareID INT PRIMARY KEY,
SoftwareName NCHAR(400) NOT NULL,
SoftwareSystemType NVARCHAR(50) NOT NULL
);
GO
INSERT INTO dbo.Software(SoftwareID,SoftwareName,SoftwareSystemType)
VALUES (1,'Dynamics AX 2009','ERP');
INSERT INTO dbo.Software(SoftwareID,SoftwareName,SoftwareSystemType)
VALUES (2,'Dynamics NAV 2009','SCM');
INSERT INTO dbo.Software(SoftwareID,SoftwareName,SoftwareSystemType)
VALUES (3,'Dynamics CRM 2011','CRM');
INSERT INTO dbo.Software(SoftwareID,SoftwareName,SoftwareSystemType)
VALUES (4,'Dynamics CRM 2013','CRM');
INSERT INTO dbo.Software(SoftwareID,SoftwareName,SoftwareSystemType)
VALUES (5,'Dynamics CRM 2015','CRM');
GO
/*
CREATE UNIQUE INDEX IUN_Software_SoftwareName_SoftareSystemType
ON dbo.Software(SoftwareName,SoftwareSystemType);
GO
*/
-- Session #1
BEGIN TRANSACTION;
UPDATE dbo.Software
SET SoftwareName='Dynamics CRM',
SoftwareSystemType='CRM'
WHERE SoftwareID=5;
WAITFOR DELAY '00:00:15' -- 15 seconds delay; you have less than 15 seconds to switch SSMS window to session #2
UPDATE dbo.Software
SET SoftwareName='Dynamics AX',
SoftwareSystemType='ERP'
WHERE SoftwareID=1;
COMMIT
--ROLLBACK
PRINT 'Session #1 results:';
SELECT *
FROM dbo.Software;
Session #2 (SSMS > New Query > F5 (Execute))
USE DemoEXISTS;
GO
-- Session #2
DECLARE
#SoftwareName NVARCHAR(100),
#SoftwareSystemType NVARCHAR(50);
SELECT
#SoftwareName=N'Dynamics AX',
#SoftwareSystemType=N'ERP';
PRINT 'Session #2 results:';
IF NOT EXISTS(SELECT *
FROM dbo.Software s
WHERE s.SoftwareName=#SoftwareName
AND s.SoftwareSystemType=#SoftwareSystemType)
BEGIN
PRINT 'Session #2: INSERT';
INSERT INTO dbo.Software(SoftwareID,SoftwareName,SoftwareSystemType)
VALUES (6,#SoftwareName,#SoftwareSystemType);
END
PRINT 'Session #2: FINISH';
SELECT *
FROM dbo.Software;
Results:
Session #1 results:
SoftwareID SoftwareName SoftwareSystemType
----------- ----------------- ------------------
1 Dynamics AX ERP
2 Dynamics NAV 2009 SCM
3 Dynamics CRM 2011 CRM
4 Dynamics CRM 2013 CRM
5 Dynamics CRM CRM
Session #2 results:
Session #2: INSERT
Session #2: FINISH
SoftwareID SoftwareName SoftwareSystemType
----------- ----------------- ------------------
1 Dynamics AX ERP <-- duplicate (row updated by session #1)
2 Dynamics NAV 2009 SCM
3 Dynamics CRM 2011 CRM
4 Dynamics CRM 2013 CRM
5 Dynamics CRM CRM
6 Dynamics AX ERP <-- duplicate (row inserted by session #2)
There is a great solution for this problem ,You can use the Merge Keyword of Sql
Merge MyTargetTable hba
USING (SELECT Id = 8, Name = 'Product Listing Message') temp
ON temp.Id = hba.Id
WHEN NOT matched THEN
INSERT (Id, Name) VALUES (temp.Id, temp.Name);
You can check this before following, below is the sample
IF OBJECT_ID ('dbo.TargetTable') IS NOT NULL
DROP TABLE dbo.TargetTable
GO
CREATE TABLE dbo.TargetTable
(
Id INT NOT NULL,
Name VARCHAR (255) NOT NULL,
CONSTRAINT PK_TargetTable PRIMARY KEY (Id)
)
GO
INSERT INTO dbo.TargetTable (Name)
VALUES ('Unknown')
GO
INSERT INTO dbo.TargetTable (Name)
VALUES ('Mapping')
GO
INSERT INTO dbo.TargetTable (Name)
VALUES ('Update')
GO
INSERT INTO dbo.TargetTable (Name)
VALUES ('Message')
GO
INSERT INTO dbo.TargetTable (Name)
VALUES ('Switch')
GO
INSERT INTO dbo.TargetTable (Name)
VALUES ('Unmatched')
GO
INSERT INTO dbo.TargetTable (Name)
VALUES ('ProductMessage')
GO
Merge MyTargetTable hba
USING (SELECT Id = 8, Name = 'Listing Message') temp
ON temp.Id = hba.Id
WHEN NOT matched THEN
INSERT (Id, Name) VALUES (temp.Id, temp.Name);
More of a comment link for suggested further reading...A really good blog article which benchmarks various ways of accomplishing this task can be found here.
They use a few techniques: "Insert Where Not Exists", "Merge" statement, "Insert Except", and your typical "left join" to see which way is the fastest to accomplish this task.
The example code used for each technique is as follows (straight copy/paste from their page) :
INSERT INTO #table1 (Id, guidd, TimeAdded, ExtraData)
SELECT Id, guidd, TimeAdded, ExtraData
FROM #table2
WHERE NOT EXISTS (Select Id, guidd From #table1 WHERE #table1.id = #table2.id)
-----------------------------------
MERGE #table1 as [Target]
USING (select Id, guidd, TimeAdded, ExtraData from #table2) as [Source]
(id, guidd, TimeAdded, ExtraData)
on [Target].id =[Source].id
WHEN NOT MATCHED THEN
INSERT (id, guidd, TimeAdded, ExtraData)
VALUES ([Source].id, [Source].guidd, [Source].TimeAdded, [Source].ExtraData);
------------------------------
INSERT INTO #table1 (id, guidd, TimeAdded, ExtraData)
SELECT id, guidd, TimeAdded, ExtraData from #table2
EXCEPT
SELECT id, guidd, TimeAdded, ExtraData from #table1
------------------------------
INSERT INTO #table1 (id, guidd, TimeAdded, ExtraData)
SELECT #table2.id, #table2.guidd, #table2.TimeAdded, #table2.ExtraData
FROM #table2
LEFT JOIN #table1 on #table1.id = #table2.id
WHERE #table1.id is null
It's a good read for those who are looking for speed! On SQL 2014, the Insert-Except method turned out to be the fastest for 50 million or more records.
I know this post is old but I found an original way to insert values into a table with the key words INSERT INTO and EXISTS.
I say original because I did not find it on the Internet.
Here it is :
INSERT INTO targetTable(c1,c2)
select value1,value2
WHERE NOT EXISTS(select 1 from targetTable where c1=value1 and c2=value2 )
Ingnoring the duplicated unique constraint isn't a solution?
INSERT IGNORE INTO tblSoftwareTitles...
Beginners question here. I wish to populate a table with many rows of data straight from a query I'm running in the same session. I wish to do it using with excutemany(). currently, I insert each row as a tuple, as shown in the script below.
Select Query to get the needed data:
This query returns data with 4 columns Parking_ID, Snapshot_Date, Snapshot_Time, Parking_Stat
park_set_stat_query = "SET #row_number = 0;"
park_set_stat_query2 = "SET #row_number2 = 0;"
# one time load to catch only the changes done in the input table
park_change_stat_query = """select in1.Parking_ID,
in1.Snapshot_Date as Snapshot_Date,
in1.Snapshot_Time as Snapshot_Time,
in1.Parking_Stat
from (SELECT
Parking_ID,
Snapshot_Date,
Snapshot_Time,
Parking_Stat,
(#row_number:=#row_number + 1) AS num1
from Fact_Parking_Stat_Input
WHERE Parking_Stat<>0) as in1
left join (SELECT
Parking_ID,
Snapshot_Date,
Snapshot_Time,
Parking_Stat,
(#row_number2:=#row_number2 + 1)+1 AS num2
from Fact_Parking_Stat_Input
WHERE Parking_Stat<>0) as in2
on in1.Parking_ID=in2.Parking_ID and in1.num1=in2.num2
WHERE (CASE WHEN in1.Parking_Stat<>in2.Parking_Stat THEN 1 ELSE 0 END=1) OR num1=1"""
Here is the insert part of the script:
as you can see below I insert each row to the destination table Fact_Parking_Stat_Input_Alter
mycursor = connection.cursor()
mycursor2 = connection.cursor()
mycursor.execute(park_set_stat_query)
mycursor.execute(park_set_stat_query2)
mycursor.execute(park_change_stat_query)
# # keep only changes in a staging table named Fact_Parking_Stat_Input_Alter
qSQLresults = mycursor.fetchall()
for row in qSQLresults:
Parking_ID = row[0]
Snapshot_Date = row[1]
Snapshot_Time = row[2]
Parking_Stat = row[3]
#SQL query to INSERT a record into the table Fact_Parking_Stat_Input_Alter.
mycursor2.execute('''INSERT into Fact_Parking_Stat_Input_Alter (Parking_ID, Snapshot_Date, Snapshot_Time, Parking_Stat)
values (%s, %s, %s, %s)''',
(Parking_ID, Snapshot_Date, Snapshot_Time, Parking_Stat))
# Commit your changes in the database
connection.commit()
mycursor.close()
mycursor2.close()
connection.close()
How can I improve the code so it will insert the data in on insert command?
Thanks
Amir
MYSQL has an INSERT INTO command that is probably far more efficient than query it in python, pulling it and re-iserting
https://www.mysqltutorial.org/mysql-insert-into-select/
What's the best / fastest solution for the following task:
Used technology: MySQL database + Python
I'm downloading a data.sql file. It's format:
INSERT INTO `temp_table` VALUES (group_id,city_id,zip_code,post_code,earnings,'group_name',votes,'city_name',person_id,'person_name',networth);
INSERT INTO `temp_table` VALUES (group_id,city_id,zip_code,post_code,earnings,'group_name',votes,'city_name',person_id,'person_name',networth);
.
.
Values in each row differ.
Tables structures: http://sqlfiddle.com/#!9/8f10d6
A person can have multiple cities
A person can be only in one group or can be without group.
A group can have multiple persons
And i know from which country these .sql data are.
I need to split these data into 3 tables. And I will be updating data that are already in the tables and if not then I will create new row.
So I came up with 2 solutions:
Split the values from the file via python and then perform for each line 3x select + 3x update/insert in the transaction.
Somehow bulk insert the data into a temporary table and then manipulate with the data inside a database - meaning for each row in the temporary table I will perform 3 select queries (one to each actual table) and if I find row I will send 3x (update query and if not then I run insert query).
I will be running this function multiple times per day with over 10K lines in the .sql file and it will be updating / creating over 30K rows in the database.
//EDIT
My inserting / updating code now:
autocomit = "SET autocommit=0"
with connection.cursor() as cursor:
cursor.execute(autocomit)
data = data.sql
lines = data.splitlines
for line in lines:
with connection.cursor() as cursor:
cursor.execute(line)
temp_data = "SELECT * FROM temp_table"
with connection.cursor() as cursor:
cursor.execute(temp_data)
temp_data = cursor.fetchall()
for temp_row in temp_data:
group_id = temp_row[0]
city_id = temp_row[1]
zip_code = temp_row[2]
post_code = temp_row[3]
earnings = temp_row[4]
group_name = temp_row[5]
votes = temp_row[6]
city_name = temp_row[7]
person_id = temp_row[8]
person_name = temp_row[9]
networth = temp_row[10]
group_select = "SELECT * FROM perm_group WHERE group_id = %s AND countryid_fk = %s"
group_values = (group_id, countryid)
with connection.cursor() as cursor:
row = cursor.execute(group_select, group_values)
if row == 0 and group_id != 0: #If person doesn't have group do not create
group_insert = "INSERT INTO perm_group (group_id, group_name, countryid_fk) VALUES (%s, %s, %s)"
group_insert_values = (group_id, group_name, countryid)
with connection.cursor() as cursor:
cursor.execute(group_insert, group_insert_values)
groupid = cursor.lastrowid
elif row == 1 and group_id != 0:
group_update = "UPDATE perm_group SET group_name = group_name WHERE group_id = %s and countryid_fk = %s"
group_update_values = (group_id, countryid)
with connection.cursor() as cursor:
cursor.execute(group_update, group_update_values)
#Select group id for current row to assign correct group to the person
group_certain_select = "SELECT id FROM perm_group WHERE group_id = %s and countryid_fk = %s"
group_certain_select_values = (group_id, countryid)
with connection.cursor() as cursor:
cursor.execute(group_certain_select, group_certain_select_values)
groupid = cursor.fetchone()
#.
#.
#.
#Repeating the same piece of code for person and city
Measured time: 206 seconds - which is not acceptable.
group_insert = "INSERT INTO perm_group (group_id, group_name, countryid_fk) VALUES (%s, %s, %s) ON DUPLICATE KEY UPDATE group_id = %s, group_name = %s"
group_insert_values = (group_id, group_name, countryid, group_id, group_name)
with connection.cursor() as cursor:
cursor.execute(group_insert, group_insert_values)
#Select group id for current row to assign correct group to the person
group_certain_select = "SELECT id FROM perm_group WHERE group_id = %s and countryid_fk = %s"
group_certain_select_values = (group_id, countryid)
with connection.cursor() as cursor:
cursor.execute(group_certain_select, group_certain_select_values)
groupid = cursor.fetchone()
Measured time: from 30 to 50 seconds. (Still quite long, but it's getting better)
Are there any other better (faster) options on how to do it?
Thanks in advice, popcorn
I would recommend that you load the data into a staging table and do the processing in SQL.
Basically, your ultimate result is a set of SQL tables, so SQL is necessarily going to be part of the solution. You might as well put as much logic into the database as you can, to simply the number of tools needed.
Loading 10,000 rows should not take much time. However, if you have a choice of data formats, I would recommend a CSV file over inserts. inserts incur extra overhead, if only because they are larger.
Once the data is in the database, I would not worry much about the processing time for storing the data in three tables.
How to retrieve inserted id after inserting row in SQLite using Python? I have table like this:
id INT AUTOINCREMENT PRIMARY KEY,
username VARCHAR(50),
password VARCHAR(50)
I insert a new row with example data username="test" and password="test". How do I retrieve the generated id in a transaction safe way? This is for a website solution, where two people may be inserting data at the same time. I know I can get the last read row, but I don't think that is transaction safe. Can somebody give me some advice?
You could use cursor.lastrowid (see "Optional DB API Extensions"):
connection=sqlite3.connect(':memory:')
cursor=connection.cursor()
cursor.execute('''CREATE TABLE foo (id integer primary key autoincrement ,
username varchar(50),
password varchar(50))''')
cursor.execute('INSERT INTO foo (username,password) VALUES (?,?)',
('test','test'))
print(cursor.lastrowid)
# 1
If two people are inserting at the same time, as long as they are using different cursors, cursor.lastrowid will return the id for the last row that cursor inserted:
cursor.execute('INSERT INTO foo (username,password) VALUES (?,?)',
('blah','blah'))
cursor2=connection.cursor()
cursor2.execute('INSERT INTO foo (username,password) VALUES (?,?)',
('blah','blah'))
print(cursor2.lastrowid)
# 3
print(cursor.lastrowid)
# 2
cursor.execute('INSERT INTO foo (id,username,password) VALUES (?,?,?)',
(100,'blah','blah'))
print(cursor.lastrowid)
# 100
Note that lastrowid returns None when you insert more than one row at a time with executemany:
cursor.executemany('INSERT INTO foo (username,password) VALUES (?,?)',
(('baz','bar'),('bing','bop')))
print(cursor.lastrowid)
# None
All credits to #Martijn Pieters in the comments:
You can use the function last_insert_rowid():
The last_insert_rowid() function returns the ROWID of the last row insert from the database connection which invoked the function. The last_insert_rowid() SQL function is a wrapper around the sqlite3_last_insert_rowid() C/C++ interface function.
SQLite 3.35's RETURNING clause:
CREATE TABLE users (
id INTEGER PRIMARY KEY,
first_name TEXT,
last_name TEXT
);
INSERT INTO users (first_name, last_name)
VALUES ('Jane', 'Doe')
RETURNING id;
returns requested columns of the inserted row in INSERT, UPDATE and DELETE statements. Python usage:
cursor.execute('INSERT INTO users (first_name, last_name) VALUES (?,?)'
' RETURNING id',
('Jane', 'Doe'))
row = cursor.fetchone()
(inserted_id, ) = row if row else None
I would like to remove the duplicate data only if three columns (name, price and new price) matching with the same data. But in an other python script.
So the data can insert in to the database, but with an other python script, I want to delete this duplicate data by a cron job.
So in this case:
cur.execute("INSERT INTO cars VALUES(8,'Hummer',41400, 49747)")
cur.execute("INSERT INTO cars VALUES(9,'Volkswagen',21600, 36456)")
are duplicates. Example script with inserted data:
import psycopg2
import sys
con = None
try:
con = psycopg2.connect(database='testdb', user='janbodnar')
cur = con.cursor()
cur.execute("CREATE TABLE cars(id INT PRIMARY KEY, name VARCHAR(20), price INT, new price INT)")
cur.execute("INSERT INTO cars VALUES(1,'Audi',52642, 98484)")
cur.execute("INSERT INTO cars VALUES(2,'Mercedes',57127, 874897)")
cur.execute("INSERT INTO cars VALUES(3,'Skoda',9000, 439788)")
cur.execute("INSERT INTO cars VALUES(4,'Volvo',29000, 743878)")
cur.execute("INSERT INTO cars VALUES(5,'Bentley',350000, 434684)")
cur.execute("INSERT INTO cars VALUES(6,'Citroen',21000, 43874)")
cur.execute("INSERT INTO cars VALUES(7,'Hummer',41400, 49747)")
cur.execute("INSERT INTO cars VALUES(8,'Hummer',41400, 49747)")
cur.execute("INSERT INTO cars VALUES(9,'Volkswagen',21600, 36456)")
cur.execute("INSERT INTO cars VALUES(10,'Volkswagen',21600, 36456)")
con.commit()
except psycopg2.DatabaseError, e:
if con:
con.rollback()
print 'Error %s' % e
sys.exit(1
finally:
if con:
con.close()
You can do this in one statement without additional round-trips to the server.
DELETE FROM cars
USING (
SELECT id, row_number() OVER (PARTITION BY name, price, new_price
ORDER BY id) AS rn
FROM cars
) x
WHERE cars.id = x.id
AND x.rn > 1;
Requires PostgreSQL 8.4 or later for the window function row_number().
Out of a set of dupes the smallest id survives.
Note that I changed "new price" to new_price.
Or use the EXISTS semi-join, that #wildplasser posted as comment to the same effect.
Or, to by special request of CTE-devotee #wildplasser, with a CTE instead of the subquery ... :)
WITH x AS (
SELECT id, row_number() OVER (PARTITION BY name, price, new_price
ORDER BY id) AS rn
FROM cars
)
DELETE FROM cars
USING x
WHERE cars.id = x.id
AND x.rn > 1;
Data modifying CTE requires Postgres 9.1 or later.
This form will perform about the same as the one with the subquery.
Use a GROUP BY SQL statement to identify the rows, together with the initial primary key:
duplicate_query = '''\
SELECT MIN(id), "name", price, "new price"
FROM cars
GROUP BY "name", price, "new price"
HAVING COUNT(ID) > 1
'''
The above query selects the lowest primary key id for each group of (name, price, "new price") rows where there is more than one primary key id. For your sample data, this will return:
7, 'Hummer', 41400, 49747
9, 'Volkswagen', 21600, 36456
You can then use the returned data to delete the duplicates:
delete_dupes = '''
DELETE
FROM cars
WHERE
"name"=%(name)s AND price=%(price)s AND "new price"=%(newprice)s AND
id > %(id)s
'''
cur.execute(duplicate_query)
dupes = cur.fetchall()
cur.executemany(delete_dupes, [
dict(name=r[1], price=r[2], newprice=r[3], id=r[0])
for r in dupes])
Note that we delete any row where the primary key id is larger than the first id with the same 3 columns. For the first dupe, only the row with id 8 will match, for the second dupe the row with id 10 matches.
This does do a separate delete for each dupe found. You can combine this into one statement with a WHERE EXISTS sub-select query:
delete_dupes = '''\
DELETE FROM cars cdel
WHERE EXISTS (
SELECT *
FROM cars cex
WHERE
cex."name" = cdel."name" AND
cex.price = cdel.price AND
cex."new price" = cdel."new price" AND
cex.id > cdel.id
)
'''
cur.execute(delete_dupes)
This instructs PostgreSQL to delete any row for which there are other rows with the same name, price and new price but with a primary key that is higher than the current row.